BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Modify flex file to ignore contents of <base64></base64> tags #20

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

Needed for #3

coryschillaci commented 9 years ago

Solved by @lambdaloop in 7b17ae500ca47faf1adb71fabc5fb44353b89e34

coryschillaci commented 9 years ago

@lambdaloop I tried compiling xmltweet with your new flex file. When I run on a test file containing,

<reply_count><int>0</int></reply_count></post><post><user>000000000000000</user><itemid><int>595</int><\
/itemid><subject><base64>0KDQsNC30LPQvtCy0L7RgA==</base64></subject><event><base64>PGJsb2NrcXVvdGUgc2l0ZT0iaHR0c\
DovL2FiYy5saXZlam91cm5hbC5jb20vIj7QoNCw0LfQs9C+0LLQvtGAINC90LAgPGEgaHJlZj0iaHR0cDovL2FiYy5saXZlam91cm5hbC5jb20vN\
zg0OTAuaHRtbCI+0L/RgNC40L3Rg9C00LjRgtC10LvRjNC90YvRhSDRgNCw0LHQvtGC0LDRhTwvYT4g0LzQtdC20LTRgyDQstC+0YDQvtC8INCyI\
NC30LDQutC+0L3QtSDCq9C+0YLRgNC40YbQsNC70L7QucK7INC4INC90LDRh9Cw0LvRjNGB0YLQstC+0LwuDQoNCtCd0LDRh9Cw0LvRjNC90LjQu\
jog4oCUINCd0YMg0YfRgtC+LCDQsdGD0LTQtdGI0Ywg0YDRg9Cx0LjRgtGMINC70LXRgT8NCtCS0L7RgDog4oCUINCa0L7QvNCw0L3QtNC40YAsI\
NC90LUg0Y8g0LXQs9C+INGB0LDQttCw0LssINC90LUg0LzQvdC1INC10LPQviDQuCDRgNGD0LHQuNGC0YwhPC9ibG9ja3F1b3RlPg0KPGxqLXJlc\
G9zdCBidXR0b249IisxIiAvPg==</base64></event>

The output (stripped of number tokens) is

<reply_count>,<int>,</int>,</reply_count>,</post>,<post>,<user>,</user>,<itemid>,<int>,</int>,
</itemid>,<subject>,</subject>,<event>,base,pgjsb,nrcxvvdgugc,l,zt,iahr,c,dovl,fiyy,saxzlam,cm,hbc,jb,vij,qoncw,lfqs,c,
llqvtgainc,lagpgegahjlzj,iahr,cdovl,fiyy,saxzlam,cm,hbc,jb,vn,zg,otauahrtbci,l,rgnc,l,rg,c,
ljrgtc,lvrjnc,yvrhsdrgncw,lhqvtgc,ldrhtwvyt,g,lzqtdc,ltrgydqstc,ydqvtc,incyi,nc,ldqutc,l,qtsdcq,c,
ylrgnc,ybqsnc,l,quck,inc,inc,ldrh,cw,lvrjngb,ylqstc,lwudqonctcd,ldrh,cw,lvrjnc,ljqu,jog,ocuincd,ymg,
yfrgtc,lcdqsdgd,ltqtdgi,ywg,ydrg,cx,ljrgtgminc,lxrgt,nctcs,l,rgdog,ocuinca,l,qvncw,l,qtnc,yasi,nc,lug,
y,g,lxqs,c,ingb,ldqttcw,lssinc,lug,lzqvdc,inc,lpqvidqucdrgngd,lhqungc,ywhpc,ibg,ja,f,b,rlpg,kpgxqlxjlc,
g,zdcbidxr,b,iisxiiavpg,base,</event>

Any idea what's up?

coryschillaci commented 9 years ago

It looks like if there is a newline anywhere in the base64 tag it does this. Hopefully that doesn't happen in the data.