eea / eea.corpus

Machine Learning and Natural Language Processing of the EEA Corpus via spaCy, Textacy and pyLDAvis and other useful NLP algorithms.
GNU General Public License v3.0
14 stars 2 forks source link

Split functionality of eea.corpus into multiple scripts #1

Open tiberiuichim opened 7 years ago

tiberiuichim commented 7 years ago

I see several problems that we want to handle:

  1. corpus preparation (take text, transform it, save it)
  2. "generate products". For example, generate an LDA topic visualisation and save its html+js payload to a folder, generate the TMVA topics browsers, etc
  3. Browse and use the generated products.

We can do these as command line scripts, for point 3 we can use a simple http directory index listing (if generated products are all static files).

demarant commented 7 years ago

@tiberiuichim Yes we can refactor as we go. I generally agree on the split. To be aware is that 1) preparation is not generic, it is highly dependent on the 2) product output. We are not building a framework here I think, we are merely using existing framework and techniques to interpret and build smart products on top of EEA corpus as input data.