BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Not getting all <event> tags #24

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

This post,

<post><user>ababalamaha</user><itemid><int>269</int></itemid><subject><base64>0J7QsdGK0Y/QstC70LXQvdC40LUg0LTQu9GPINGE0YDQtdC90LTQvtCy</base64></subject><event><base64>0JTRgNGD0LfRjNGPLCDRjyDQt9Cw0LrRgNGL0LLQsNGOINGN0YLQvtGCINC20YPRgNC90LDQuywg0LzQvtC20LXRgtC1INC+0YLRhNGA0LXQvdC00LjQstCw0YLRjNGB0Y8g0YHQviDRgdC/0L7QutC+0LnQvdC+0Lkg0LTRg9GI0L7RjiA6KQ0K0J7QvSDQsdGL0Lsg0LTQu9GPINC80LXQvdGPINC30LDQsdCw0LLQvdGL0Lwg0Lgg0LHQtdC30LDQu9Cw0LHQtdGA0L3Ri9C8LCDQutCw0Log0Lgg0LXQs9C+INC90LDQt9Cy0LDQvdC40LUsINGC0LDQutC40LwsINGB0L7QsdGB0YLQstC10L3QvdC+LCDQuCDQvtGB0YLQsNC90LXRgtGB0Y8g0LIg0L/QsNC80Y/RgtC4IDopDQrQmtC+0LzQvNC10L3RgtCw0YDQuNC4INGB0LrRgNGL0LLQsNGO0YLRgdGPINC/0L4g0YPQvNC+0LvRh9Cw0L3QuNGOLg==</base64></event><ditemid><int>68873</int></ditemid><eventtime><string>2008-10-05 10:30:00</string></eventtime><props><opt_screening><string>A</string></opt_screening><commentalter><int>1300680462</int></commentalter><personifi_tags><string>nterms:yes</string></personifi_tags><revnum><int>4</int></revnum><personifi_lang><string>nil</string></personifi_lang><hasscreened><int>1</int></hasscreened><revtime><int>1223196482</int></revtime></props><logtime><string>2008-10-05 07:36:07</string></logtime><anum><int>9</int></anum><url><string>http://ababalamaha.livejournal.com/68873.html</string></url><event_timestamp><int>1223202600</int></event_timestamp><reply_count><int>0</int></reply_count></post>

gets tokenized as

<user>,ababalamaha,</user>,<itemid>,<int>,</int>,</itemid>,<subject>,</event>,<ditemid>,<int>,</int>,</ditemid>,<eventtime>,<string>,:,:,</string>,</eventtime>,<props>,<opt_screening>,<string>,a,</string>,</opt_screening>,<commentalter>,<int>,</int>,</commentalter>,<personifi_tags>,<string>,nterms,:,yes,</string>,</personifi_tags>,<revnum>,<int>,</int>,</revnum>,<personifi_lang>,<string>,nil,</string>,</personifi_lang>,<hasscreened>,<int>,</int>,</hasscreened>,<revtime>,<int>,</int>,</revtime>,</props>,<logtime>,<string>,:,:,</string>,</logtime>,<anum>,<int>,</int>,</anum>,<url>,<string>,http,:,ababalamaha,.,livejournal,.,com,html,</string>,</url>,<event_timestamp>,<int>,</int>,</event_timestamp>,<reply_count>,<int>,</int>,</reply_count>

It looks like xmltweet is munching everything between two different <base64> and </base64> tags. In this case </base64></subject><event><base64> is dropped along with two base 64 entries.

lambdaloop commented 9 years ago

Ah right, I forgot I can't parse greedily. I pushed a fix, it seems to work. Could you test it now?

coryschillaci commented 9 years ago

Great, thanks! Only issue was that you need to get the new version of BIDMach so that the c++ files are all correct, I had to recompile.

lambdaloop commented 9 years ago

Oh woops, I used the one in /opt on mercury. I guess it's out of date...