Open rraub opened 8 years ago
I've been looking into it and I think that the most accurate (albeit slowest) method would be to combine all adjacent terms of the same type into all possible combinations. I think it would also be smart to give larger chains a higher relevance rating (if relevant is between 0 - 1, then relevance = length / (length + 1)).
ex) ken
price
ian
johnson
will add ken price
ken price ian
ken price ian johnson
price ian
price ian johnson
ian johnson
terms
FYI the Stanford folks have a tokenizer that might be useful.
Another drawback to using n-grams (like you suggested) is having to reconstruct the results since your getting more than one tag per word.
so
ken price
becomes two tokensken
price
instead of one.