UniversalDataTool / universal-data-tool

Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
https://universaldatatool.com
MIT License
1.94k stars 191 forks source link

Feature: PDF Image Annotation #49

Open seveibar opened 4 years ago

seveibar commented 4 years ago

Support PDFs in Image Segmentation and Image Classification. Please thumbs up if you want it.

Ownmarc commented 4 years ago

Been working with PDF in the past and, imo, the best is to convert everything to JPG using a lib like pdf2image or something similar. That allows control over the DPI for the image creation that should not be overlooked when comes the time to do inference. Most of the time, if your dataset is pdf, you will probably do inference on pdf too and then you need to integrate a pdf converter in your pipeline at some point and its not hard to do.

If we want to make something that is not well supported by other annotation tools or lib, something like PDF2text with a 2D mapping between the raw text and the original PDF would be insane. This could then be used for NLP tasks or vision tasks to find the right zones to get the information needed or proceed with OCR on targeted zones.

This is not easy to do, in Python I use pdfminer and pypdf2 to extract text. pdfminer can return coords of each letter/words while pypdf2 can't. Simple pdf decryption like password are supported, but no support for online request to decrypt (that require things like FileOpen)

seveibar commented 4 years ago

Agreed that combining the NER/NLP text tasks with PDFs would be an amazing feature.

As far as PDF viewing goes, there are two solutions I think will be pretty good

  1. Desktop Application transform that converts PDF to Image. Then use image segmentation normally.
  2. Support PDFs in the underlying react-image-annotate library by using an in-browser PDF renderer.

Approach (2) would work on web and would provide a nicer end user experience. Approach (1) is a bit easier. You could also implement both of these because some may prefer (1) for building their model anyway.