[Project] Possible NLP applications for repository

MESH-Research / knowledge-commons-works

The next-generation research repository for the Knowledge Commons (formerly Humanities Commons)

https://hcommons.org

MIT License

4 stars 0 forks source link

[Project] Possible NLP applications for repository #220

Open monotasker opened 9 months ago

monotasker commented 9 months ago

get text from uploaded files (pdf, docx, pptx)
- research available python libraries for text extraction
prepare text
- clean
- separate languages
- lemmatize/stem
- POS tagging
- embeddings
- named entity recognition
kinds of analysis
- word collocations
- topic modelling (unsupervized? supervized?)
- network analysis
- sentiment analysis
how can subject headings on deposits be used as kind of classification
- FAST subject headings (LOC headings)
- has anyone developed tools for LOC headings?
how can other metadata be used for analysis?
how can we do auto summarization? free LLMs?

cassandralem-msu commented 8 months ago

Another possible thing to do: create network graph to improve recommendation systems

cassandralem-msu commented 8 months ago

Python library for .pdf text extraction: pypdf Article describing 3 different Python libraries for .docx text extraction Python library for .pptx text extraction: python-pptx

koutiany commented 8 months ago

I used this one before: https://pypi.org/project/pytesseract/ Amazing support on multilanguage options: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

koutiany commented 8 months ago

embeddings.pdf

The document about embeddings. I mentioned collocation and embeddings when we have clean corpora. Collocations focus on the frequent co-occurrence of words and their specific combinations, while word embeddings aim to represent words in a continuous, semantic space to capture their meanings and relationships

koutiany commented 8 months ago

https://huggingface.co/docs/transformers/index this is where I get some of my knowledge from. :) Wish i could spend more time here.

koutiany commented 8 months ago

Python lib summary for text extraction see here