Open monotasker opened 9 months ago
Another possible thing to do: create network graph to improve recommendation systems
Python library for .pdf text extraction: pypdf Article describing 3 different Python libraries for .docx text extraction Python library for .pptx text extraction: python-pptx
I used this one before: https://pypi.org/project/pytesseract/ Amazing support on multilanguage options: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
The document about embeddings. I mentioned collocation and embeddings when we have clean corpora. Collocations focus on the frequent co-occurrence of words and their specific combinations, while word embeddings aim to represent words in a continuous, semantic space to capture their meanings and relationships
https://huggingface.co/docs/transformers/index this is where I get some of my knowledge from. :) Wish i could spend more time here.