[FEATURE] Bin URLs to prioritize faster links

bewbylover commented 8 months ago

Is your feature request related to a problem? Please describe. Not really. But it does seem that the scrapes might get stuck behind slower down-loaders like pixeldrain. I frequently see pixeldrain collections need to complete before other downloads are processed. I currently have 7k files in the download queue that tend to burst once a set has completed, then pause again while another collection completes.

Describe the solution you'd like Could we bin all the links so that they are downloaded at increased speed? Scraping the URL.txt list should be first priority, but each link there could be binned as well by starting URL or something, and then processed with whatever delay and connection number is appropriate. Once those are processed focus on the download links with the same way. Grouping collections before direct image/video links. And process them first, then process item links. Item links can also be binned by first URL so that nothing blocks the faster downloads. Possibly a priority config file on the download sites that could keep slower sites towards the end. Also sharing the URL bin counts might be a good status for the DL queue, and also the scrape queue.

Additional context N/A

Do you allow any development help? Would you be interested in any?

Jules-WinnfieldX commented 8 months ago

I frequently see pixeldrain collections need to complete before other downloads are processed.

This would really only be the case if your simultaneous downloads is the same number as the simultaneous downloads per domain.

If they aren't, then it's essentially a round robin, and any domain can take a spot of the max value as long as it hasn't reached the max per domain.

Could we bin all the links so that they are downloaded at increased speed?

Not really, as the tasks to scrape and subsequently the tasks to download are not linked at all. Priority is given to whatever links arrived first (essentially). There is no ordering, or anything else happening in the background, it's simply first come first serve.

Do you allow any development help? Would you be interested in any?

I'm not against it, though this has always been predominantly a personal project for me to learn with. People are welcome to make pull requests, but that doesn't necessarily mean I will merge them.

bewbylover commented 8 months ago

The downloads number isn't the same. I think this is where that is defined from "global_settings.yaml":

Rate_Limiting_Options: connection_timeout: 15 download_attempts: 10 download_delay: 0.5 max_simultaneous_downloads: 100 max_simultaneous_downloads_per_domain: 6 rate_limit: 50 read_timeout: 300

From PD I am seeing only 2 simultanious downloads. I should have up to 100. And When I start I get more than 2 at once. I have never seen 100 attempt at once though I have 7k downloads queued.

That's good news on the scraping/downloading. If I am right about the blocking binning the URLs could help. I can try to play with the release and see what I can understand about the "blocked" download queue and share any of that insight.

That's fine. I'll share what I uncover and if you want to incorporate it, or see a proposed code change I can try that. Would the best way to share that be to open a new feature request?

Thanks again for all you do.

Jules-WinnfieldX commented 8 months ago

From PD I am seeing only 2 simultanious downloads

PD has an internal CDL limit of 2.

I likely won't be adding binning regardless. It'd require a substantial amount of rewriting, and fundamentally change how things are working at the current moment. Also adds a lot of complexity. Also there is no way to know what bin will be fastest.

You can play around with it as much as you want though, and it'd be through a pull request.

bewbylover commented 8 months ago

That explains why it is always 2, but still not sure why the other downloads don't trigger.

You don't need to know what bin is fastest if it rotates through them, it's just ensures there isn't starvation. My guess is that is what is happening now for some reason, but I don't have enough insight into what the download queue looks like.

I'll play a little and try to understand your internals and see if there is anything that can be done without the bins. Thanks for the info.

Jules-WinnfieldX commented 8 months ago

To shed some light. What you are saying isn't really possible (original post).

All the scrapers run independently, as well as all the downloaders. There is a singular scraper and a singular downloader for each website. They all run concurrently (actually asynchronous tasks, but same same, but different).

The only thing that'll get caught up behind pixeldrain links, is more pixeldrain links.

Jules-WinnfieldX commented 8 months ago

To expand, there also isn't a queue. It is quite literally just first come first serve, with most sites, the scraping pauses while the latest download happens. Others, all the download tasks are created, and they wait for an open spot to begin processing.

bewbylover commented 8 months ago

Makes sense.

https://github.com/Jules-WinnfieldX/CyberDropDownloader/issues/785 this is part of the reason I am thinking the way I am, and my lack of having read through the code. This file is from the second post on the first page of a 4 page thread that I have had on the URL list for a month or more. So I should have downloaded it awhile ago, I have preserved the DB since V4, so I don't redownload things. But for some reason it just broke free and attempted to download today. So for some reason the task wasn't moving into the open spot. At some point no matter how long I run the only things I have downloading are 2 PD. I am not sure why it is able to take open spots, when the other download tasks aren't.

Jules-WinnfieldX / CyberDropDownloader

[FEATURE] Bin URLs to prioritize faster links #786