fossology / atarashi

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.
http://fossology.github.io/atarashi
GNU General Public License v2.0
26 stars 23 forks source link

feat(doc2vec) : Semantic Text Similarity Algorithm with dataset & training code #58

Closed hastagAB closed 5 years ago

hastagAB commented 5 years ago

Description

New Open Source License Scanning Algorithm: Semantic Text Similarity find similarity between documents according to its semantics. The Gensim implementation of Doc2Vec converts the whole document (unlike word2vec) into vector with their labels. The Doc2Vec model is trained using the filename as its label and license text as the document. The current training dataset is the txt format of license-list-data provided by SPDX.

Files

Test

amanjain97 commented 5 years ago

Please fix the travis CI. Code looks good. Nice work. I think the accuracy of the model is not good but we can improve later. @hastagAB

hastagAB commented 5 years ago

Have rebasing issue in branch. Unable to fix. Closing this