DS4SD / docling-parse

Simple package to extract text with coordinates from programmatic PDFs
MIT License
24 stars 7 forks source link

support for tagged pdfs? <StructTreeNode> #54

Open mllife opened 1 day ago

mllife commented 1 day ago

I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure StructTreeNode and I want to know if you can add the support for it, ie. low level handling of code for this case. My knowledge about tagged pdfs is limited. So, I have couple of questions:

Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures? Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I we can do with pdf parsers. Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging. My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements

PeterStaar-IBM commented 1 day ago

@mllife Yes, we could add this as extra info. However, the tags get generally identified by docling via visual models.

mllife commented 1 day ago

Yes, that ML model works, but sometimes the pdf have in-built tags which are always accurate comes; directly from the vendors/distributors and there is no way to utilise them programmatically (so far that I know of). If you add this feature, this will be the only library to do it. "pdfalyzer" https://github.com/michelcrypt4d4mus/pdfalyzer this is one tool to analyse the , but it does not allow reading from the structure and dump it into a format which can be utilised, like tables to csv, get bboxes from tags itself ? docling-parse is a unique project because you guys are doing it from the scratch, so I have some hope.

PeterStaar-IBM commented 1 day ago

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

mllife commented 1 day ago

I can defiantly say, I will be the first one to test it and provide you feedback on it.

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

I can defiantly say, I will be the first one to test it and provide you feedback on it. Thanks a ton in advance.

PeterStaar-IBM commented 1 day ago

@mllife please, can you provide me then some examples where you know it is there? I would not know where to search for it.

mllife commented 21 hours ago

Sorry, I can't share any of these files but, there is way to create them using "accessibility features" in Foxit pdf pro (trail is available for free) https://www.foxit.com/pdf-editor/advanced-editing/ (https://www.youtube.com/watch?v=Oub-mmPXASk) Table tagging is automatically done. I think it's not 100% correct always but it works on most of the pdfs (it should be sufficient to create some examples to test), also if you export any stylised word 2013+ file to pdf, it should be tagged automatically. Let me know if this is helpful.