BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

xmltweet doesn't parse all the </string> occurrences as xml tags #18

Closed anasrferreira closed 9 years ago

anasrferreira commented 9 years ago

Number of instances of <string> and </string> in dictionary.imat (or file.xml.imat) is not always the same. Run:

/opt/BIDMach_1.0.0-full-linux-x86_64/bidmach utils.scala examples.scala
newdict.count("<string>")
res20: Double = 98953.0
newdict.count("</string>")
res21: Double = 96029.0

To see when this occurs, search for tags <event> and </event>

Seems like in most cases it finds the pattern ;< instead.

find(xmlFile(eventIdx(?,0))==newdict("<string>") .* (xmlFile(eventIdx(?,1)-1) != newdict("</string>")))
val n = newdict(xmlFile(eventIdx(2,0)->eventIdx(2,1))(find(xmlFile(eventIdx(2,0)->eventIdx(2,1))>0)))
n(size(n,1)-1)
res24: String = string
n(size(n,1)-2)
res28: String = ;<

An example where it is parsed correctly:

find(xmlFile(eventIdx(?,0))==newdict("<string>") .* (xmlFile(eventIdx(?,1)-1) == newdict("</string>")))
val nn = newdict(xmlFile(eventIdx(7,0)->eventIdx(7,1))(find(xmlFile(eventIdx(7,0)->eventIdx(7,1))>0)))
nn(size(nn,1)-1)
res31: String = </string>
nn(size(nn,1)-2)
res32: String = music
anasrferreira commented 9 years ago

We can work around this, but @lambdaloop is looking into this. Might be of interest to @jcanny

coryschillaci commented 9 years ago

Solved by @lambdaloop in 7b17ae500ca47faf1adb71fabc5fb44353b89e34