Open victor-ab opened 3 years ago
Hello @victor-ab ,
Yes, this would be quite possible - all you would need to do would be to add another preprocessor which uses PyMuPDF:
https://github.com/allenai/pawls/tree/master/cli/pawls/preprocessors
The interface is not super complicated, we would welcome a PR if you feel like contributing!
@DeNeutoy the Idea was to have that functionality as a part of the UI, not only through splitting all the json's tokens into a single character. That gives me another idea: Implement a UI to change the extracted text to fix any kind of OCR mistakes.
But yes, adding the PyMuPDF as a preprocessor is a step towards that. I can try to contribute with that, but I can't help with UI.
Sometimes we need to select specific parts of the text inside a token. I wanted to select only the amount in this case.
Adding the ability to split it somehow would help a lot. PyMuPDF might help here as a preprocessor, as it can extract the text at the character level.