fossology / atarashi

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.
http://fossology.github.io/atarashi
GNU General Public License v2.0
26 stars 23 forks source link

feat(atarashi): Pre process license once #14

Closed GMishx closed 6 years ago

GMishx commented 6 years ago

Improvements

  1. Create a CSV with preprocessed data from original licenseList.csv. (Redundant task)
  2. Use NLTK library to tokenize, filter stop words and stemming (improves similarity matching).
  3. Install python modules for current user (avoid installing using root).

How to test

  1. Create new preprocessed license list from original list using python LicensePreprocessor.py <LicenseList.csv> <processedList.csv>
  2. Use this processed list in other modules.
  3. Pariksha still needs original license list.
amanjain97 commented 6 years ago

New progress bar animation looks good and working as expected.