Closed eliranwong closed 7 years ago
I just did two queries to see how many words I have.
count(//w)
=> 137832
count(//w[@osisId])
=> 137832
So my counts do not match yours for the trees. But they do for Nestle.csv:
$ wc -l Nestle1904.csv
137779 Nestle1904.csv
I don't know why, but the trees seem to have 4 more words than Nestle.csv. I will leave this bug open until I find and fix it.
Found several duplicate syntax trees. I think this goes back to an experiment GBI was doing with alternate interpretations. I want to add support for alternate interpretations, but want to do it a different way.
After removing the duplicate trees, the word counts match.
no of words in low-fat: 137776 (by search osisId="[^\r<>"]*?") no of words in Nestle1904.csv: 137779 why the difference?