dvmorozov / arxiv

ArxivExpress - arxiv.org client for Android and iOS, ArxivNavigator - interactive arxiv.org metadata visualization. I would appreciate any way of contributing: GitHub issue, email or pull request.
https://dvmorozov.github.io/arxiv/
Other
0 stars 0 forks source link

Topic maining #109

Closed dvmorozov closed 1 year ago

dvmorozov commented 1 year ago

Solution

  1. Implement script collecting dictionary. Represent document as "bag-of-words". Save dictionary into file. :heavy_check_mark:
  2. Implement iterator class over files in directory. :heavy_check_mark:
  3. Implement model and use iterator class. :heavy_check_mark:
  4. Save corpus into file (every text should be converted into single line) for processing with META. :heavy_check_mark:
  5. Remove Greek letters from the list of special characters. :question:
  6. Output topics into JSON. :heavy_check_mark:
  7. Add lemmatization. Add reference to the main page. :heavy_check_mark:
  8. Set encoding in reading and writing files as script parameter. :heavy_check_mark:

Related

  1. 111.

  2. 112.

  3. 113.

References

https://www.qblocks.cloud/blog/best-nlp-libraries-python

https://pypi.org/project/gensim/ :heavy_check_mark: https://radimrehurek.com/gensim/ https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html (model) https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html (corpus iteration) https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents (lemmatization)

https://www.nltk.org/ :heavy_check_mark: (pip install --user -U nltk; after that recreate virtual environment inheriting packages)

https://github.com/clips/pattern

https://www.machinelearningplus.com/nlp/gensim-tutorial/ https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#stanfordcorenlplemmatization

https://stackoverflow.com/questions/8884188/how-to-read-and-write-ini-file-with-python3