allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
384 stars 74 forks source link

Split tokens #110

Open victor-ab opened 3 years ago

victor-ab commented 3 years ago

Sometimes we need to select specific parts of the text inside a token. I wanted to select only the amount in this case.

image

Adding the ability to split it somehow would help a lot. PyMuPDF might help here as a preprocessor, as it can extract the text at the character level.

DeNeutoy commented 3 years ago

Hello @victor-ab ,

Yes, this would be quite possible - all you would need to do would be to add another preprocessor which uses PyMuPDF:

https://github.com/allenai/pawls/tree/master/cli/pawls/preprocessors

The interface is not super complicated, we would welcome a PR if you feel like contributing!

victor-ab commented 3 years ago

@DeNeutoy the Idea was to have that functionality as a part of the UI, not only through splitting all the json's tokens into a single character. That gives me another idea: Implement a UI to change the extracted text to fix any kind of OCR mistakes.

But yes, adding the PyMuPDF as a preprocessor is a step towards that. I can try to contribute with that, but I can't help with UI.