ARTFL-Project / PhiloLogic4

PhiloLogic4
GNU General Public License v3.0
36 stars 11 forks source link

Tokenizing issue in collocations with hyphen tags #412

Closed clovis closed 9 years ago

clovis commented 9 years ago

The ECCO database has many in-word tags. This breaks tokenization in collocations.

We could fix this by removing the element that causes the issue. In the case of ECCO, it's ```

Richard, how do you feel about using tweaking the current tokenizing regex we use for collocations to detec in-word tags?

clovis commented 9 years ago

The problem actually ended up being in how we truncate the right-side concordance. I commented out the truncating regex on line 138: https://github.com/ARTFL-Project/PhiloLogic4/blob/master/www/reports/collocation.py#L138

This does seem to reduce the number of collocates for any given hit, but by maybe 1 or 2%, which I think is fine given that we'll be completely revamping this code soon.

clovis commented 9 years ago

Fixed in new collocation code