Open pizthewiz opened 10 years ago
Hey Jean-Pierre,
You're right, anything specific you're wanting to do? The code is not 100% complete, but I've used it for a few tasks. Basically it's a very low-level PDF library, so you have to understand something about PDFs itself and work directly on the objects / streams within the PDF. I am not planning on building anything higher level (like a pdf viewer).
You're right though, I think I have a few small utilities lying around, I can try to turn them into examples.
In my particular case, I was hoping to extract the text out of a PDF that has tabular content. I'm thinking if I read the PDF spec, the PDFReader
code might make more sense, so I might start there.
The spec is quite a hard read, PDF has had a lot of changes over the years so for me it took some time to understand all of the different features / mechanism.
The thing about text in PDF is that basically it comes down to a postscript-like language which is just a big list of draw commands. If you want to extract text you need to run / process these commands and try to understand where the text is drawn visually on the page. This is why pretty much all PDF readers have trouble with selecting text.
Wow, pulling text out of a PDF sounds like quite an adventure, and tightly coupled to the PDF file structure as well. Would omgpdf
be a good match for this excursion?
omgpdf at the moment doesn't help with processing the drawing commands. I started omgps to work on this, but also the drawing system in PDF is different than PS...
I would say depending how you want to do it a lot of the work is already done, but it's not to the point where it's a few clicks away.
In giving the project a quick look, I'm not entire sure if the reader and writer are ready for use. If so, a couple of simple examples would be a great help to understand the current capabilities and usage.