SLU-TMI / TextMining.jl

Other
24 stars 7 forks source link

Counting Spaces #84

Closed GBrew252 closed 9 years ago

GBrew252 commented 9 years ago

The 50 most frequent features table Mark shared from the EEBO cluster test seems to show that we're counting spaces. Either the most or second most frequent feature in the Ignatius, Godwin, Cavendish, and Rawley texts is an empty double quotation mark: ("",63), ("",40), ("",859), ("",67). Does this mean spaces are being used, or is this a notation for something else? In any case, that feature is not a word.

mtabor150 commented 9 years ago

These arn't spaces, they are nothing strings. Spaces would be " " (space between quotes), these are "" (consecutive quotes). I'm guessing these are caused by floating punctuation. We are splitting on white space and then removing exterior punctuation so words go from "-word." to "word", but punctuation alone goes from "--," to "". The simple fix is to set fv[""] = 0 after we make a feature vector.

mtabor150 commented 9 years ago

should be fixed now #85