input:
output:
git clone https://github.com/alexeygrigorev/namespacediscovery-pipeline.git
cd namespacediscovery-pipeline/src
python pipeline.py
Modify luigi.cfg
to set different configuration parameters
You need to at least change the following parameters:
[MlpResultsReadTask]/mlp_results
- path to the output of mlp[MlpResultsReadTask]/categories_processed
- path to the category information[DEFAULT]/intermediate_result_dir
- path to directory where pre-calculated results will be stored Other parameters ([DEFAULT]
section):
isv_type
identifier vector space model, can be nodef
, weak
or strong
vectorizer_dim_red
type of dimentionality reduction, can be none
, svd
, nmf
or random
clustering_algorithm
, now only kmeans
is implemented for PyData stack libraries such as numpy, scipy, scikit-learn and nltk it's best to use anaconda installer
Not all dependencies come pre-installed with anaconda, use pip
to install them:
pip install python-Levenshtein
pip install fuzzywuzzy
pip install luigi
pip install rdflib
We also need to download some data for nltk: the list of stopwords and the model for tokenization. Run it in the python console to install them:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
see SETUP.md for an example how to set up the environment
We use the following datasets as input:
Classification schemes:
The classification schemes datasets are already available in the data
directory.