Open punjabdhaputar opened 6 months ago
Hi @punjabdhaputar - could you describe the use case you have in mind for this feature? And do I understand correctly that your proposed solution would output a new PDF rather than a list of Element
objects?
Hello @MthwRobinson!
Actually I am thinking about another optional argument to the "partition" function like the following:
from unstructured.partition.auto import partition
elements = partition("my_pdf.pdf", path_for_ocr_pdf="ocr_pdf.pdf")
Where the partition function would write out a new PDF with the hidden text OCR layer to "ocr_pdf.pdf".
The use-case I have is to be able to view the PDF with the text layer and be able to highlight specific text (e.g. a small phrase, subset of the previous chunks generated).
Thanks @punjabdhaputar ! Definitely see the use case there. Writing to PDF is outside the scope of what we'd like to do within the partition
functions themselves. If you wanted to contribute an elements_to_pdf
similar to elements_to_json
though we'd be happy to consider that, as long as it doesn't introduce new dependencies.
Is your feature request related to a problem? Please describe. When I OCR a PDF, I would like to be able to open the PDF and see the OCRed text as a hidden layer.
Describe the solution you'd like I would like to have an option to output a new PDF file after the "partition" method that will be the original + a hidden text layer of the OCR text.
Additional context Slack Thread: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1715109355171469