elementdavv / internet_archive_downloader

A chrome/firefox extension that download books from Internet Archive(archive.org) and HathiTrust Digital Library (hathitrust.org)
GNU Affero General Public License v3.0
324 stars 26 forks source link

Text? #37

Closed backwoodsman7 closed 4 months ago

backwoodsman7 commented 5 months ago

Thank you for a very useful add-on.

Would it be possible to grab the text and add it to the PDF, so it's text-searchable?

elementdavv commented 5 months ago

At present only images are grabbed.

backwoodsman7 commented 5 months ago

Is that because the text is not doable, or too difficult, or is there the possibility of adding the capability at some point?

kvasirasia commented 5 months ago

Archived old books are scanned images/documents and saved into a PDF collection. They are not saved in text format. Adobe Acrobat Pro and other PDF tools are able to convert/extract text from images using OCR image recognition. Its not perfect.. Yet. Soon visual AI tools like ChatGPT-4o+, Gemini 1.5, Llava-Llama 3, hopefully can do this really good and you can convert any scanned image to separated images and searchable text. Lets enjoy that we at least can download books in image form for now.

backwoodsman7 commented 5 months ago

Most of the books on archive.org do have the text embedded, and the text can be searched when reading them online. And most of the PDFs that can be downloaded also have embedded searchable text. I was just wondering about the possibility of grabbing the text along with the images. This add-on is a great tool even without that capability, but it would be a nice option to have.

backwoodsman7 commented 5 months ago

(you can ignore this; I hadn't meant to close the thread, and it looks like to re-open it I need to make a new comment.)

elementdavv commented 4 months ago

Most of the books on archive.org do have the text embedded, and the text can be searched when reading them online. And most of the PDFs that can be downloaded also have embedded searchable text. I was just wondering about the possibility of grabbing the text along with the images. This add-on is a great tool even without that capability, but it would be a nice option to have.

Good idea. I will consider it in the next version.