fossology / atarashi

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.
http://fossology.github.io/atarashi
GNU General Public License v2.0
26 stars 23 forks source link

Improve TF-IDF agent by tuning matches threshold #95

Open xavierfigueroav opened 2 years ago

xavierfigueroav commented 2 years ago

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136

Using the evaluation.py script, I've carried out some experiments:

Algorithm Time elapsed Accuracy
1 tfidf (CosineSim) (thr=0.30) 30.19 59.0%
2 tfidf (CosineSim) (thr=0.17) 35.29 61.0%
3 tfidf (CosineSim) (thr=0.16, max_df=0.10) 27.34 62.0%
4 tfidf (CosineSim) (thr=0.16) 36.42 62.0%
5 tfidf (CosineSim) (thr=0.15) 38.45 62.0%
6 tfidf (CosineSim) (thr=0.10) 39.91 62.0%
7 tfidf (CosineSim) (thr=0.00) 61.49 62.0%
8 Ngram (CosineSim) - 57.0%
9 Ngram (BigramCosineSim) - 56.0%
10 Ngram (DiceSim) - 55.0%
11 wordFrequencySimilarity - 23.0%
12 DLD - 17.0%
13 tfidf (ScoreSim) - 13.0%

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

GMishx commented 2 years ago

That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.

Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).