AlexanderMelde / dl_for_heise

Simple shell script to batch download heise magazine PDFs. You need to have access to the archive via a paid subscription to use this.
62 stars 9 forks source link

Speedup? #5

Open gjaekel opened 2 years ago

gjaekel commented 2 years ago

Dear Alexander, fist, a big thank you to you and the other contributors mentioned in the README.

To speed up archiving a whole year I ask myself if it would be possible to trigger the generation of (all or some number of) the documents in a first step. And to start the download of all in a second step.

This may avoid to busy-wait for the generation time at every document.

I'll start a POC about this, now

gjaekel commented 2 years ago

By temporarily inserting a continue 2 before L120 ... https://github.com/AlexanderMelde/dl_for_heise/blob/2e592419e5c0c9635c7e554428670f914559c92e/download.sh#L117-L121 ... I made a POC-version that just will trigger the backend to prepare the documents.

I run this to prepare five issues. I start the unmodified version immediately afterwards, but this was to fast and it seems that I stepped into an DOS protection by to many requests at a time because I got an HTTP RC 500. But after about half an hour, I try it again and this run downloads the five issues one by on without any delay.

gjaekel commented 2 years ago

Another approach: I just set max_tries_per_download=1(L.9) and run the script multiple times.

  1. run trigger the preparation. This takes about a minute for a year of issues.
  2. run was able to get most of the issues without any waiting time. It fail at issue 22 and 26, maybe the 2nd run started just a little to early.
  3. run was able to get the remaining issues.

Maybe a good concept is just to "pull out" the retries for the download from the innermost to an outermost loop.

AlexanderMelde commented 2 years ago

Hi Guido, you're welcome! Thank you for documenting your experiments. Indeed i followed a very similar approach and just ran the scripts twice, the first run with no repetitions or wait times to just "trigger" the serverside pdf generation, and a second run to finally download them. That worked well and especially fast for most PDFs, however it wasn't really reliable (e.g. due to the DDOS protection). For this script, i decided to keep it "safe, but slow" - with high number of repetitions and long wait times, to ensure everything is downloaded, e.g. over night. If you however want to get a quicker run, with the downside of manual monitoring of the progress, everyone should feel free to adapt the parameters (as we two did :) ). Maybe we could introduce some kind of config files, or example sets of parameters to include in the README file 😄