Open xavierfigueroav opened 2 years ago
That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.
Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).
Hello.
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (
cosine similarity >= 0.30
) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136
Using the
evaluation.py
script, I've carried out some experiments:cosine similarity >= 0.00
). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is0.16
, showed in row 4.max_df
to0.10
(default is1.0
) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.max_df
value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than themax_df
percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.max_df
value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes: