MiniGlome / Archive.org-Downloader

Python3 script to download archive.org books in PDF format
857 stars 116 forks source link

Posible solution for: "Borrow Unavailable"/"Book available to patrons with print disabilities" #99

Open mrelg opened 9 months ago

mrelg commented 9 months ago

possible solution to issue #65 & issue #88 (will require some coding)

Description: Many books have been set to only be "previewable" and "available to patrons with print disabilities" with following consequences:

The "Archive.org-Downloader" script unsuccessfully tries to borrow and defaults to "This book doesn't need to be borrowed" and proceeds to download a few available/previewable pages and a bunch of images from redirections to "https://archive.org/bookreader/static/preview-unavailable.png"

example book: https://archive.org/details/electricnetworks0000unse_l8w2

  1. example of an unavailable image from page 300 (leaf 326): https://ia902509.us.archive.org/BookReader/BookReaderPreview.php?id=electricnetworks0000unse_l8w2&subPrefix=electricnetworks0000unse_l8w2&itemPath=/22/items/electricnetworks0000unse_l8w2&server=ia902509.us.archive.org&page=leaf326&fail=preview&&scale=1&rotate=0

  2. example of redirection: https://archive.org/bookreader/static/preview-unavailable.png

  3. forcing a call to direct link to the leaf 326 (n325): https://archive.org/details/electricnetworks0000unse_l8w2/page/n325/mode/1up

  4. redrects to page 300: https://archive.org/details/electricnetworks0000unse_l8w2/page/300/mode/1up

  5. after that, the image link (leaf 326) temporarily stops redirecting: https://ia902509.us.archive.org/BookReader/BookReaderPreview.php?id=electricnetworks0000unse_l8w2&subPrefix=electricnetworks0000unse_l8w2&itemPath=/22/items/electricnetworks0000unse_l8w2&server=ia902509.us.archive.org&page=leaf326&fail=preview&&scale=1&rotate=0

I'm not sure if brute forcing many page calls leads to denial of access, but I think it has to be done one at a time since after opening a few direct page links the first one of them restarts redirecting to unavailable.png

mrelg commented 9 months ago

Sadly this works only until it detects abuse from asking for too many pages too quickly, so a decent back-off timer is required.

HaaiSo commented 1 month ago

It seems impossible to reproduce stably.