Topic maining - Githubissues

dvmorozov / arxiv

ArxivExpress - arxiv.org client for Android and iOS, ArxivNavigator - interactive arxiv.org metadata visualization. I would appreciate any way of contributing: GitHub issue, email or pull request.

https://dvmorozov.github.io/arxiv/

Other

0 stars 0 forks source link

Topic maining #109

Closed dvmorozov closed 1 year ago

dvmorozov commented 1 year ago

Solution

Implement script collecting dictionary. Represent document as "bag-of-words". Save dictionary into file. :heavy_check_mark:
Implement iterator class over files in directory. :heavy_check_mark:
Implement model and use iterator class. :heavy_check_mark:
Save corpus into file (every text should be converted into single line) for processing with META. :heavy_check_mark:
Remove Greek letters from the list of special characters. :question:
Output topics into JSON. :heavy_check_mark:
Add lemmatization. Add reference to the main page. :heavy_check_mark:
Set encoding in reading and writing files as script parameter. :heavy_check_mark:

Related

111.
112.
113.

References

https://www.qblocks.cloud/blog/best-nlp-libraries-python

https://pypi.org/project/gensim/ :heavy_check_mark: https://radimrehurek.com/gensim/ https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html (model) https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html (corpus iteration) https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents (lemmatization)

https://www.nltk.org/ :heavy_check_mark: (pip install --user -U nltk; after that recreate virtual environment inheriting packages)

https://github.com/clips/pattern

https://www.machinelearningplus.com/nlp/gensim-tutorial/ https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#stanfordcorenlplemmatization

https://stackoverflow.com/questions/8884188/how-to-read-and-write-ini-file-with-python3