direct-phonology / dphon

uncover old chinese textual parallels based on sound
MIT License
12 stars 1 forks source link

score results for relevance #130

Open thatbudakguy opened 3 years ago

thatbudakguy commented 3 years ago

running against a large corpus, especially with some settings, can result in a huge volume of results. many of them are "low-quality" in that the matching portion consists of superficially similar elements that don't carry much semantic weight.

adjusting the match length can help, but there might be other heuristics we can use to improve relevance. one possibility is TF-IDF.

thatbudakguy commented 3 years ago

one possible quick n' dirty way to do this is to implement something like passim's --max-series, which for us would translate to dropping seed groups from the index if there are too many entries in the group (indicating a super common seed).

if we do TF-IDF, we can also implement that at the seed level to prune the graph early.