Open seveibar opened 4 years ago
Been working with PDF in the past and, imo, the best is to convert everything to JPG using a lib like pdf2image or something similar. That allows control over the DPI for the image creation that should not be overlooked when comes the time to do inference. Most of the time, if your dataset is pdf, you will probably do inference on pdf too and then you need to integrate a pdf converter in your pipeline at some point and its not hard to do.
If we want to make something that is not well supported by other annotation tools or lib, something like PDF2text with a 2D mapping between the raw text and the original PDF would be insane. This could then be used for NLP tasks or vision tasks to find the right zones to get the information needed or proceed with OCR on targeted zones.
This is not easy to do, in Python I use pdfminer and pypdf2 to extract text. pdfminer can return coords of each letter/words while pypdf2 can't. Simple pdf decryption like password are supported, but no support for online request to decrypt (that require things like FileOpen)
Agreed that combining the NER/NLP text tasks with PDFs would be an amazing feature.
As far as PDF viewing goes, there are two solutions I think will be pretty good
Approach (2) would work on web and would provide a nicer end user experience. Approach (1) is a bit easier. You could also implement both of these because some may prefer (1) for building their model anyway.
Support PDFs in Image Segmentation and Image Classification. Please thumbs up if you want it.