Preload the download archive for better performance

jbruchon commented 4 years ago

My YTDL archive is quite large:

$ wc -l ytdl_archive.txt
76923 ytdl_archive.txt

I have noticed that performance for "is the video in the archive?" checking becomes slow quickly, even if only 100 more video lines were added to an already-sorted archive, but a quick command-line sort of the archive results in a massive performance boost:

$ sort < ytdl_archive.txt > xxx && mv xxx ytdl_archive.txt

I would suggest that for any batch (channel, playlist, etc.) download that also uses the archive and adds at least one line to the archive, sort the archive at the end of the batch to maintain performance for future instances that use that archive. It may also be good to add an option to disable this post-batch archive sorting for anyone who also uses the archive as a record of the original order of downloads.

blackjack4494 commented 4 years ago

Sure shouldn't be a problem to add some sorting support. However I would rather not have this on by default. Some custom command like --auto-sort-archive or so should trigger that.
I don't have the time to implement this myself. Maybe someone else is willing to if you aren't able to implement it yourself which is totally fine.
Will let this open for now and maybe close it in a week or so but appending the Future tag to it.

jbruchon commented 4 years ago

Here's a better question: why is the performance sensitive to the sort order of the archive text file in the first place?

blackjack4494 commented 4 years ago

No clue. Probably some naive algorithm.

Actually found the lines
https://github.com/blackjack4494/youtube-dlc/blob/7ac0ba50ce2e60f9105f94f1c66e7d8fa9b6b692/youtube_dlc/YoutubeDL.py#L2136-L2153

This will take quite some time especially the more videos are added. Should be fine for maybe a few thousand videos. But starting with 10k+ it's going to be hell. Sorting will not really help. It is just luck in your case that the video got found earlier.
What really is needed would be some advanced searching and sorting algorithm. For bigger archives indexing would be quite benificial as well.
Maybe this can be skipped and some database could be used that handles such things (but speaking of really heavy users here).
I wonder if we can find some programmers. Doesn't even have to be in python directly (can be ported later if it's not some sketchy language or special javascript).

jbruchon commented 4 years ago

OK, I just looked at the code and I know what's going on now. There are two aspects to this issue:

The archive file is re-opened and re-scanned for every single individual video, which is massively inefficient but allows multiple simultaneous invocations of the program to share an archive (edit: share a constantly updated archive; sharing an archive as described in the next list requires no special support anyway), and
New files are appended to the end of the archive file, so scanning the archive for more recent downloads requires consumption of the entire archive before finding the desired string

There is not any sort of locking going on with the archive, and my experience trying to run two instances over the same playlist has been ugly, so I'd propose the following:

Preload the archive at program invocation
Append to the archive as usual, but maintain an internal tree of video keys instead of scanning the file every time, and add to the tree when appending a newly downloaded video to the archive
Officially consider running multiple instances over the same channel/playlist/etc. at the same time to be unsupported and at the user's own risk

This solves the problem, makes archive checks insanely faster, and no sorting of the archive file itself is necessary.

I'll work on this and make a PR if you think that sounds OK. Edit: you ninja'd me! I found it too ;-)

jbruchon commented 4 years ago

I'm working on this now. I have it working nicely so far, except it turns out that by sorting the archive file, I created the worst-case scenario for a binary search tree. It's bad because it loads so slowly with a sorted list that it looks like it has hung, but it's good because ~~I can fix the problem using a weight-balanced tree~~ I found another simpler solution.

jbruchon commented 4 years ago

PR #125 generated to fix this. Tested against a new non-existent archive and archives of line count 1, 2, and >70K, and against the 70K data in both randomized and sorted orders. Everything seems to work great. Please feel free to squash and merge.

blackjack4494 / youtube-dlc

Preload the download archive for better performance #122