Closed catilgan closed 2 years ago
Language detection: Tika seems to be java stuff. Since we started with Python (the way to go for AI folks) I suggest to have a look at Python stuff: https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language
I started with pycld2 as it seems to be the fastest (Polyglot is built on it) and simplest to start with. However the module was not updated since 2 years and installation was a bit tricky (I documented in the readme.md).
There is also pycld3 library but installation looks much more complex. The size is bigger and it's slower.
I didn't look into other options.
es, i would stay in Python Eco-system for language detection as well. The pycld2 indeed seems to be outdated. PyCLD3 would be a better choice for future.
During my quick research, i bumped into these libraries / frameworks. In particular, Spacy and TextBlob seem to be promising...
But at this stage, you can just one library of your preference and implement a quick prototype.. We can improve or select a different at later stages of the project, when we gain more insights on what we really need.
Hey guys, great progress! Your choices are understandable. Thanks for the documentation!
Seems like we're on a good path regarding text extraction.
What I wanted to add: is it possible to implement a REST API service so we could call on the microservice from the nocode test platform?
And: extracted text definitely needs to be cleaned up. I think I could be done with spacy, among many others. I suppose it might depend in some cases on the NLP model we are using on how big an impact the pre-prpcessing of the data might have, but I suspect stopword removal, punctuation removal, stemming and lemmiatization should always be performed. Inspired by this: https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy
At any rate, I think this issue should be looked into before we deal with the question of how to process tables in a PDF.
Talk soon!
catilgan @.***> schrieb am Mo., 4. Okt. 2021, 21:51:
es, i would stay in Python Eco-system for language detection as well. The pycld2 indeed seems to be outdated. PyCLD3 would be a better choice for future.
During my quick research, i bumped into these libraries / frameworks. In particular, Spacy and TextBlob seem to be promising...
- TextBlob
- Spacy
- FastText
- Polyglot
- Langdetect
- Langid
- PyCLD3
- Guesslang
But at this stage, you can just one library of your preference and implement a quick prototype.. We can improve or select a different at later stages of the project, when we gain more insights on what we really need.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bizres/core-team/issues/25#issuecomment-933803509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO4TIS7M5QG6CPADKY2LKDUFIASTANCNFSM5DL7Z2SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
and just to clarify: language detection will probably be necessary, but even more fundamental is text cleanup. it seems to me therefore this should the the issue to tackle next.
Hi @dev-ng
Are there any updates concerning this issue. I have deployed a version of text extraction tool into Hidora cloud, which can be used.
The extracted texts and the uploaded pdfs are stored in NFS data storage.
Is there any updates from you and @JohannesHool ?
Cheers, Cahit
Service "extract" is done. See PR 9.
Habe Tika angeschaut, wär natürlich cool, wenn wir gleichzeitig Sprache detecten könnten...https://tika.apache.org/2.1.0/detection.html