MESH-Research / knowledge-commons-works

The next-generation research repository for the Knowledge Commons (formerly Humanities Commons)
https://hcommons.org
MIT License
4 stars 0 forks source link

[Project] Possible NLP applications for repository #220

Open monotasker opened 9 months ago

monotasker commented 9 months ago
cassandralem-msu commented 8 months ago

Another possible thing to do: create network graph to improve recommendation systems

cassandralem-msu commented 8 months ago

Python library for .pdf text extraction: pypdf Article describing 3 different Python libraries for .docx text extraction Python library for .pptx text extraction: python-pptx

koutiany commented 8 months ago

I used this one before: https://pypi.org/project/pytesseract/ Amazing support on multilanguage options: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

koutiany commented 8 months ago

embeddings.pdf

The document about embeddings. I mentioned collocation and embeddings when we have clean corpora. Collocations focus on the frequent co-occurrence of words and their specific combinations, while word embeddings aim to represent words in a continuous, semantic space to capture their meanings and relationships

koutiany commented 8 months ago

https://huggingface.co/docs/transformers/index this is where I get some of my knowledge from. :) Wish i could spend more time here.

koutiany commented 8 months ago

Python lib summary for text extraction see here