fossology / atarashi

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.
http://fossology.github.io/atarashi
GNU General Public License v2.0
26 stars 23 forks source link

feat(tfidf): Speed and match improvement #78

Closed GMishx closed 3 years ago

GMishx commented 3 years ago

The tfidf match speed is improved by getting the array only once before the loop. The algorithm implementation was improved by removing the test data from training data.

Comparisons

After decreasing the match threshold from 80% to 30%, the match accuracy improved from 58% to 60%. Also, sight refactoring of code improved the evaluation time from 3238 seconds to 59.02 seconds. 80% 30%
Evaluator result Master-tfidf-cosine Master-tfidf-cosine-new

Following are the changes in matching results:

36c36
< DPTC => NULL
---
> DPTC => BSD-2-Clause
58c58
< JSON => NULL
---
> JSON => MIT-feh
64c64
< LGPL-3.0+ => NULL
---
> LGPL-3.0+ => LGPL-2.1+-KDE-exception
70c70
< MIT-style => NULL
---
> MIT-style => curl
74c74
< MirOS => NULL
---
> MirOS => MirOS
91c91
< WTFPL => NULL
---
> WTFPL => WTFPL

After removing the test data from training data, the accuracy further increased from 60% to 62%. New-tfidf-cosine

Following is the changes in result:

63c63
< LGPL-3.0 => NULL
---
> LGPL-3.0 => LGPL-2.1+-KDE-exception
80c80
< OpenSSL => NULL
---
> OpenSSL => openvpn-openssl-exception
82c82
< PHP-3.0 => NULL
---
> PHP-3.0 => PHP-3.0
87c87
< SGI-B-2.0 => NULL
---
> SGI-B-2.0 => SGI-B-2.0