Open thatbudakguy opened 3 years ago
one possible quick n' dirty way to do this is to implement something like passim's --max-series
, which for us would translate to dropping seed groups from the index if there are too many entries in the group (indicating a super common seed).
if we do TF-IDF, we can also implement that at the seed level to prune the graph early.
running against a large corpus, especially with some settings, can result in a huge volume of results. many of them are "low-quality" in that the matching portion consists of superficially similar elements that don't carry much semantic weight.
adjusting the match length can help, but there might be other heuristics we can use to improve relevance. one possibility is TF-IDF.