Improve TF-IDF agent by tuning matches threshold

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136

Using the evaluation.py script, I've carried out some experiments:

	Algorithm	Time elapsed	Accuracy
1	*tfidf (CosineSim) (thr=0.30)*	*30.19*	*59.0%*
2	tfidf (CosineSim) (thr=0.17)	35.29	61.0%
3	tfidf (CosineSim) (thr=0.16, max_df=0.10)	27.34	62.0%
4	tfidf (CosineSim) (thr=0.16)	36.42	62.0%
5	tfidf (CosineSim) (thr=0.15)	38.45	62.0%
6	tfidf (CosineSim) (thr=0.10)	39.91	62.0%
7	tfidf (CosineSim) (thr=0.00)	61.49	62.0%
8	Ngram (CosineSim)	-	57.0%
9	Ngram (BigramCosineSim)	-	56.0%
10	Ngram (DiceSim)	-	55.0%
11	wordFrequencySimilarity	-	23.0%
12	DLD	-	17.0%
13	tfidf (ScoreSim)	-	13.0%

Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
- Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
- Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
My findings may help improve other agents that use thresholds, such as Ngram.
This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.

fossology / atarashi

Improve TF-IDF agent by tuning matches threshold #95