balta2ar / manuscript-dl

Collection of scripts to download digitized manuscripts from various online libraries
23 stars 4 forks source link

Download too slow. #2

Closed syafiqhadzir closed 5 years ago

syafiqhadzir commented 5 years ago

I need at least 2 days to download a manuscript (400 pages) with the resolution of 13.

Any idea how to improve this code? My internet speed is 100Mbp/s.

balta2ar commented 5 years ago

I'd prefer to keep the code as simple as possible, so I'd avoid adding any parallelization into it. Instead, you can use --pages option that allows you to specify which pages to download. What you could do is you could create a bash script that runs multiple instances of this tool with page ranges, e.g.:

#!/bin/bash
python3 bl.uk.py add_ms_24686 --resolution 12 --pages 1:49 &
python3 bl.uk.py add_ms_24686 --resolution 12 --pages 50:99 &
# ... and so on...

& means that the script will run in background. The script will finish instantly, but background jobs will keep working and they will be printing logs in the console.

Another approach (which I would have taken if I were to download in parallel) is to use parallel tool (https://www.gnu.org/software/parallel/). First you create a similar text file with bash commands, e.g.:

python3 bl.uk.py add_ms_24686 --resolution 12 --pages 1:49
python3 bl.uk.py add_ms_24686 --resolution 12 --pages 50:99
... and so on ...

Notice that this time the file does not contain shebang (#!/bin/bash) on the first line and also there are no & symbols at the end of the lines. Name that file whatever you want, e.g. commands.txt. Now run parallel (you may need to install this command separately, depending on your OS/distribution):

parallel -j10 --bar bash -c "{}" :::: <(cat commands.txt)

This will run 10 instances of the downloader in parallel showing a nice-looking progress bar.

Lastly, when all the pages are finished downloading, you will need to run the script without --pages option once again to recombine all the pages. The script will still send occasional network requests, but this time it's gonna be much faster since most data will be reread from disk.

Hope that helps.

syafiqhadzir commented 5 years ago

Thanks.

amkaak commented 2 years ago

Hello, please help me to download from http://www.bl.uk/