Open mllife opened 1 week ago
@mllife Yes, we could add this as extra info. However, the tags get generally identified by docling via visual models.
Yes, that ML model works, but sometimes the pdf have in-built tags
I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).
I can defiantly say, I will be the first one to test it and provide you feedback on it.
I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).
I can defiantly say, I will be the first one to test it and provide you feedback on it. Thanks a ton in advance.
@mllife please, can you provide me then some examples where you know it is there? I would not know where to search for it.
Sorry, I can't share any of these files but, there is way to create them using "accessibility features" in Foxit pdf pro (trail is available for free) https://www.foxit.com/pdf-editor/advanced-editing/ (https://www.youtube.com/watch?v=Oub-mmPXASk) Table tagging is automatically done. I think it's not 100% correct always but it works on most of the pdfs (it should be sufficient to create some examples to test), also if you export any stylised word 2013+ file to pdf, it should be tagged automatically. Let me know if this is helpful.
hello, @PeterStaar-IBM , any update on this?
Yes, I looked into it, but I have found very very few documents that use this, hence, it is not a prioirity right now.
I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure StructTreeNode and I want to know if you can add the support for it, ie. low level handling of code for this case. My knowledge about tagged pdfs is limited. So, I have couple of questions:
Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures? Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I we can do with pdf parsers. Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging. My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements