Checksums are very slow

roffikk commented 4 years ago

In version 1.5.0 jwb-index script is very slow synchronizing with jw.org. These are steps I presume your script is taking (I didn't look into code, just observing its behaviour): 1) Downloading json files from all categories. This is fast. 2) Checking it against files on disk. This is the problem. Before update it was really fast, maybe not instant, but now it takes 10-15 minutes or more. 3) Downloading new or updated files. There's no lag here. 4) Creating links. It's OK.

How can I help track down this issue? Is there a log file?

I managed to register on GitHub and write this comment before it moved to 3rd phase.

allejok96 commented 4 years ago

Haha okay that is slow. Thanks for registering and bringing my attention to this.

By default the script checks the md5 checksum of files. This is probably what's going on. You should be able to turn it off with --no-checksum.

I don't know why it has been fast before tho. Checksumming has been enabled by default since forever. It might have been off in 2017 but that would be a bug (just for reference https://github.com/allejok96/jw-scripts/commit/97e9c767417c6ff7e3932d0aca9e9afcc970c404#diff-d1acb20e5fa951d2afe925c2e3706ef8)

I think it checks all existing files, every time ... As far as I can remember it has done this for a long time too. If it wasn't broken... :) Or your catalogue has grown big.

I'll consider to change the default, since md5 is cpu and disk heavy. A simple file size check is enough most of the time.

If it's not that, we have a problem.

PS nope there is no log file, I'm afraid. And Python debugging is a mess without a good IDE.

allejok96 commented 4 years ago

I just saw --no-checksum isn't working... :facepalm:

allejok96 commented 4 years ago

I'll push this fix later, until then: in arguments.py, please change line 135 from

add_predefined('--no-checksum', action='store_false', dest='checksum',

to

add_predefined('--no-checksum', action='store_false', dest='checksums',

roffikk commented 4 years ago

Ok, I did some calculations and the time between 10 and 15 minutes seems reasonable for checking md5 sums, if it's checking every file. I have currently over 170 GB of movies (1666 files). When I first used your tool, I downloaded about 100 GB of data. So, it didn't grow big that much. I'm not sure what version I used before, but quite sure it didn't take that long.

The patch works fine, the script is close to instant after downloading indexes.

I suppose you deleted one comment, but it clarified why the script was deleting fully downloaded file, sometimes several times in a row. Somehow the checksum had to be wrong. Yesterday I put the file in folder manually and it accepted it (even without turning off checksums).

I've got an idea. Maybe it's sufficient to check md5 sum only for new and updated files only after downloading them? When the file is already on the disk, there's not much it could happen to it - so we can assume that file is intact next time, if the ~~date~~ size wasn't changed.

allejok96 commented 4 years ago

You're absolutely right. The checking of all files dates back to the days I was trying to create a fully automated, headless Rasperry Pi playing downloading and playing videos infinitely, and I had to deal with the risk of SD card corruption and buggy video players...

And since you say sometimes the sums are even incorrect, I think it's better to have it default to only checking size. It's pretty uncommon for a download to get corrupted. Web browsers don't check md5s, and I doubt even JW Library does, since it creates such a CPU load...

For a moment I thought you had the reverse the order of point 2. and 3. in your first message, but when I saw you were right I deleted my last message. But yeah, there's a lack of good info about the procedure of the script... Even I had it mixed up.

I just realized another thing... The checksums are only checked for the files included in the request. This fact is not very obvious either. That means if you run --category LatestVideos it will only check those 10-20 files. But if you run --category VideoOnDemand, it will basically check everything.

So, here's the plan:

default to no checksum
add option --checksum=new for downloaded files only (simply --checksum will default to this)
add option --checksum=all for all indexed files (can run withot --download just for checking)
files with bad checksum (and or size?) won't be deleted, only display warning (slightly broken is better than nothing?)
add option --overwrite to re-download bad files. Don't delete if we won't download.
partially downloaded files that gets completed and fails checks (md5 optional) are always deleted and a fresh download started
add more verbosity to the output so the order of operation becomes more clear (can be controlled with -q)

What do you think?

roffikk commented 4 years ago

Oh, that's great, but it seems like a lot of work for you. I like particularly the "Don't delete if we won't download", cause now files first get deleted, and later re-downloaded. Or not.

So, I am for it, but maybe you can simplify this? Don't work too hard;) There are more important things in life.

allejok96 commented 4 years ago

Thanks for your consideration and feedback. Don't worry, I'm on vacation :) But I can't go anywhere, because Corona, you know. It may sound like much work but I think it can be achieved with only a few changes to the download code.

allejok96 commented 4 years ago

Default behavior is now changed, the old one can be activated with --checksum and --fix-broken.

You were right, it easier said than done, but that was because I discovered so many other bugs :) Well, now they are fixed I hope. Good thing people aren't utilizing all the features.

allejok96 / jw-scripts

Checksums are very slow #29