asu-ke-web-services / search-api

Search API for documents, data, research, people, etc
MIT License
2 stars 1 forks source link

Come up with ideas on how to handle the NER tagger tokenizing on whitespace #72

Open rraub opened 8 years ago

rraub commented 8 years ago

so ken price becomes two tokens ken price instead of one.

iajohns1 commented 8 years ago

I've been looking into it and I think that the most accurate (albeit slowest) method would be to combine all adjacent terms of the same type into all possible combinations. I think it would also be smart to give larger chains a higher relevance rating (if relevant is between 0 - 1, then relevance = length / (length + 1)).

ex) ken price ian johnson will add ken price ken price ian ken price ian johnson price ian price ian johnson ian johnson terms

rraub commented 8 years ago

FYI the Stanford folks have a tokenizer that might be useful.

Another drawback to using n-grams (like you suggested) is having to reconstruct the results since your getting more than one tag per word.