Thanks to provide this convenient toolkit
I retrieve bm25 and tfidf sparse vector from lucene indexer (provide by pyserini)
and use this project to generate sparse indexer to search.
i find that these indexer can not beat original lucene search results.
(this problem seems not have much effect on tiny datasets or semantic disperse datasets,
but with the dataset become larger, the shortcomings seems can not be omitted which is the situation to use this project.)
This is not the problem of your clustering search algorithm. But the sparse feature itself.
And if i use SVD to decrease the dimension of sparse data, it can only maintain topic level feature.
So i don’t understand the truly usage of sparse feature except calculate some search scores(like bm25)
Because they seems weak than truly lexicon based score (bm25) and dense semantic similarity based on BERT
sentence embedding (like Sentence-Transformers)
Can you provide some truly awesome text sparse feature construction reference materials that can use this project in
a suitable way ?
Thanks to provide this convenient toolkit I retrieve bm25 and tfidf sparse vector from lucene indexer (provide by pyserini) and use this project to generate sparse indexer to search. i find that these indexer can not beat original lucene search results. (this problem seems not have much effect on tiny datasets or semantic disperse datasets, but with the dataset become larger, the shortcomings seems can not be omitted which is the situation to use this project.)
This is not the problem of your clustering search algorithm. But the sparse feature itself. And if i use SVD to decrease the dimension of sparse data, it can only maintain topic level feature. So i don’t understand the truly usage of sparse feature except calculate some search scores(like bm25) Because they seems weak than truly lexicon based score (bm25) and dense semantic similarity based on BERT sentence embedding (like Sentence-Transformers)
Can you provide some truly awesome text sparse feature construction reference materials that can use this project in a suitable way ?