Closed GBrew252 closed 9 years ago
These arn't spaces, they are nothing strings. Spaces would be " " (space between quotes), these are "" (consecutive quotes). I'm guessing these are caused by floating punctuation. We are splitting on white space and then removing exterior punctuation so words go from "-word." to "word", but punctuation alone goes from "--," to "". The simple fix is to set fv[""] = 0 after we make a feature vector.
should be fixed now #85
The 50 most frequent features table Mark shared from the EEBO cluster test seems to show that we're counting spaces. Either the most or second most frequent feature in the Ignatius, Godwin, Cavendish, and Rawley texts is an empty double quotation mark: ("",63), ("",40), ("",859), ("",67). Does this mean spaces are being used, or is this a notation for something else? In any case, that feature is not a word.