Closed backwoodsman7 closed 4 months ago
At present only images are grabbed.
Is that because the text is not doable, or too difficult, or is there the possibility of adding the capability at some point?
Archived old books are scanned images/documents and saved into a PDF collection. They are not saved in text format. Adobe Acrobat Pro and other PDF tools are able to convert/extract text from images using OCR image recognition. Its not perfect.. Yet. Soon visual AI tools like ChatGPT-4o+, Gemini 1.5, Llava-Llama 3, hopefully can do this really good and you can convert any scanned image to separated images and searchable text. Lets enjoy that we at least can download books in image form for now.
Most of the books on archive.org do have the text embedded, and the text can be searched when reading them online. And most of the PDFs that can be downloaded also have embedded searchable text. I was just wondering about the possibility of grabbing the text along with the images. This add-on is a great tool even without that capability, but it would be a nice option to have.
(you can ignore this; I hadn't meant to close the thread, and it looks like to re-open it I need to make a new comment.)
Most of the books on archive.org do have the text embedded, and the text can be searched when reading them online. And most of the PDFs that can be downloaded also have embedded searchable text. I was just wondering about the possibility of grabbing the text along with the images. This add-on is a great tool even without that capability, but it would be a nice option to have.
Good idea. I will consider it in the next version.
Thank you for a very useful add-on.
Would it be possible to grab the text and add it to the PDF, so it's text-searchable?