KJHJason / Cultured-Downloader-CLI

Command-line version of the original Cultured Downloader Python program
GNU General Public License v3.0
23 stars 3 forks source link

Concurrent download #45

Closed palapapa closed 1 year ago

palapapa commented 1 year ago

Could you add support for concurrent download for multiple files at the same time to speed up the download?

KJHJason commented 1 year ago

But there is already concurrent downloads though?

palapapa commented 1 year ago

But I see it only downloads one post at a time. Does it concurrently download the files in a single post and a single post only?

KJHJason commented 1 year ago

That was the case for Fantia in previous versions of this program, it would have downloaded multiple posts at once but with reference to issue https://github.com/KJHJason/Cultured-Downloader/issues/41, the signed Fantia URL(s) expire within a short period of time.

Hence, to fix this, I changed the logic such that it downloads one post at a time but the files within the post are downloaded concurrently.

palapapa commented 1 year ago

Thanks. I was downloading from fantia.

palapapa commented 1 year ago

Sorry if this is a stupid question. I haven't worked with AWS before, and after a quick Google I found that a signed URL is supposed to allow a user to access some file for a limited time. I don't understand how downloading from multiple posts at the same time causes the URL to expire, but downloading multiple files within a single post doesn't.

KJHJason commented 1 year ago

Yes, Fantia uses AWS S3 signed URLs with a short expiry duration of several minutes (10-15 minutes I believe) for users who have the link to view the content.

To answer your question, my old download logic obtains all the signed URLs from all the posts before the concurrent downloads start. Hence, there is a possible edge case where if the user's connection speed is slow, the last few signed URLs might have expired ending up in a 403 response. However, if I were to obtain all the signed URLs within a post before the concurrent downloads start (the current download logic), it is unlikely for the obtained signed URLs within a post to expire unless the post has a lot of files to download which is unlikely.

Conversely, it is possible to do this concurrently by having a goroutine obtain the JSON from Fantia's API and process the JSON response before downloading the files concurrently within the post. The CPU and RAM usage should not be too huge with nested goroutine pools since goroutines are lightweight.

However, the compromise is that the progress indicator would end up being less verbose and would end up being more complex due to the occasional reCAPTCHA checks. Also, this might not be future-proof if Fantia decides to add some sort of bot protection like Cloudflare as a reverse proxy to their CDN since this download behaviour would likely be flagged.

Nonetheless, I might still consider adding an option to download Fantia posts concurrently if there are more requests for it. However, how many posts are you downloading that you would require concurrency for downloading Fantia posts? I believe that the download speed for Fantia is already relatively fast compared to Pixiv Fanbox and Kemono.