Closed jbruchon closed 4 years ago
Sure shouldn't be a problem to add some sorting support. However I would rather not have this on by default. Some custom command like --auto-sort-archive
or so should trigger that.
I don't have the time to implement this myself. Maybe someone else is willing to if you aren't able to implement it yourself which is totally fine.
Will let this open for now and maybe close it in a week or so but appending the Future tag to it.
Here's a better question: why is the performance sensitive to the sort order of the archive text file in the first place?
No clue. Probably some naive algorithm.
Actually found the lines
https://github.com/blackjack4494/youtube-dlc/blob/7ac0ba50ce2e60f9105f94f1c66e7d8fa9b6b692/youtube_dlc/YoutubeDL.py#L2136-L2153
This will take quite some time especially the more videos are added. Should be fine for maybe a few thousand videos. But starting with 10k+ it's going to be hell.
Sorting will not really help. It is just luck in your case that the video got found earlier.
What really is needed would be some advanced searching and sorting algorithm. For bigger archives indexing would be quite benificial as well.
Maybe this can be skipped and some database could be used that handles such things (but speaking of really heavy users here).
I wonder if we can find some programmers. Doesn't even have to be in python directly (can be ported later if it's not some sketchy language or special javascript).
OK, I just looked at the code and I know what's going on now. There are two aspects to this issue:
There is not any sort of locking going on with the archive, and my experience trying to run two instances over the same playlist has been ugly, so I'd propose the following:
This solves the problem, makes archive checks insanely faster, and no sorting of the archive file itself is necessary.
I'll work on this and make a PR if you think that sounds OK. Edit: you ninja'd me! I found it too ;-)
I'm working on this now. I have it working nicely so far, except it turns out that by sorting the archive file, I created the worst-case scenario for a binary search tree. It's bad because it loads so slowly with a sorted list that it looks like it has hung, but it's good because I can fix the problem using a weight-balanced tree I found another simpler solution.
PR #125 generated to fix this. Tested against a new non-existent archive and archives of line count 1, 2, and >70K, and against the 70K data in both randomized and sorted orders. Everything seems to work great. Please feel free to squash and merge.
My YTDL archive is quite large:
I have noticed that performance for "is the video in the archive?" checking becomes slow quickly, even if only 100 more video lines were added to an already-sorted archive, but a quick command-line sort of the archive results in a massive performance boost:
I would suggest that for any batch (channel, playlist, etc.) download that also uses the archive and adds at least one line to the archive, sort the archive at the end of the batch to maintain performance for future instances that use that archive. It may also be good to add an option to disable this post-batch archive sorting for anyone who also uses the archive as a record of the original order of downloads.