bizres / core-team

businessresponsibility.ch is a project to strengthen transparency and democratic control over the human rights performance of Swiss companies.
https://www.businessresponsibility.ch/
MIT License
2 stars 0 forks source link

Implement a PDF text extraction solution with Python #25

Closed catilgan closed 2 years ago

dudic commented 3 years ago

Habe Tika angeschaut, wär natürlich cool, wenn wir gleichzeitig Sprache detecten könnten...https://tika.apache.org/2.1.0/detection.html

dev-ng commented 3 years ago

Language detection: Tika seems to be java stuff. Since we started with Python (the way to go for AI folks) I suggest to have a look at Python stuff: https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language

dev-ng commented 3 years ago

I started with pycld2 as it seems to be the fastest (Polyglot is built on it) and simplest to start with. However the module was not updated since 2 years and installation was a bit tricky (I documented in the readme.md).

There is also pycld3 library but installation looks much more complex. The size is bigger and it's slower.

I didn't look into other options.

catilgan commented 3 years ago

es, i would stay in Python Eco-system for language detection as well. The pycld2 indeed seems to be outdated. PyCLD3 would be a better choice for future.

During my quick research, i bumped into these libraries / frameworks. In particular, Spacy and TextBlob seem to be promising...

But at this stage, you can just one library of your preference and implement a quick prototype.. We can improve or select a different at later stages of the project, when we gain more insights on what we really need.

dudic commented 3 years ago

Hey guys, great progress! Your choices are understandable. Thanks for the documentation!

Seems like we're on a good path regarding text extraction.

What I wanted to add: is it possible to implement a REST API service so we could call on the microservice from the nocode test platform?

And: extracted text definitely needs to be cleaned up. I think I could be done with spacy, among many others. I suppose it might depend in some cases on the NLP model we are using on how big an impact the pre-prpcessing of the data might have, but I suspect stopword removal, punctuation removal, stemming and lemmiatization should always be performed. Inspired by this: https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy

At any rate, I think this issue should be looked into before we deal with the question of how to process tables in a PDF.

Talk soon!

catilgan @.***> schrieb am Mo., 4. Okt. 2021, 21:51:

es, i would stay in Python Eco-system for language detection as well. The pycld2 indeed seems to be outdated. PyCLD3 would be a better choice for future.

During my quick research, i bumped into these libraries / frameworks. In particular, Spacy and TextBlob seem to be promising...

  • TextBlob
  • Spacy
  • FastText
  • Polyglot
  • Langdetect
  • Langid
  • PyCLD3
  • Guesslang

But at this stage, you can just one library of your preference and implement a quick prototype.. We can improve or select a different at later stages of the project, when we gain more insights on what we really need.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bizres/core-team/issues/25#issuecomment-933803509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO4TIS7M5QG6CPADKY2LKDUFIASTANCNFSM5DL7Z2SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

dudic commented 3 years ago

and just to clarify: language detection will probably be necessary, but even more fundamental is text cleanup. it seems to me therefore this should the the issue to tackle next.

catilgan commented 2 years ago

Hi @dev-ng

Are there any updates concerning this issue. I have deployed a version of text extraction tool into Hidora cloud, which can be used.

The extracted texts and the uploaded pdfs are stored in NFS data storage.

https://extractor.bizres.ch/docs/index.html

Is there any updates from you and @JohannesHool ?

Cheers, Cahit

dev-ng commented 2 years ago

Service "extract" is done. See PR 9.