Cannot complete download of large photosets - checking existing files requires connection which can fail + slow

eggplantedd commented 2 years ago

Worth opening a separate post for.

This is not about the flickr-api being slow or even that it has errors- just the way the program handles this in my attempt to run it unattended.

The program will occasionally hit HTTP Server Error 500 which forces it to close. Thankfully, upon restart it will check to see if a file has already been downloaded. The issue is how this is handled.

It seems a working connection to the Flickr website is required to check files. I'm guessing it starts the download/connection process with the latest photoset, and then only check if it exists locally.

This leaves the program liable to fail at the checking stage due to API errors, making it very hard to reach the resume download stage with large photosets, as well as being slow.

For example - I am on 17,402 images. On average I'm going to guess it can check 1.6 photos a second, which means 3 hours of no API issues before it can resume downloading.

Using https://github.com/chebum/Supervisor to automatically restart the program, I have left it running all day just to see it stuck checking files.

eggplantedd commented 2 years ago

So the solution- make the checking process for existing files an 'offline' job.

Make the program create a list of the folders and files before starting download (+ file size*) and check them off as it works down them.

When the program is restarted after a 'fatal' API error, you can whizz through this list and resume right where you left off.

This isn't just about knowing where to resume- the whole file checking process is just shifted offline. You could even redownload the list at program restart to validate the previous list. Even if this took a while I would be happy to work this way.

I know you could move files around in the meantime, but the same could be said about the current check process, so no change.

*I don't know if the flickr-api lets you check file size without downloading,

beaufour commented 2 years ago

I've never focused too much on the speed of download here, or downloading massive sets, so it's not surprising that it's not working super well :) Currently the logic is that it gets the list of all photo sets and then starts from one end and downloads each set. The issues are: 1) The Flickr API is quite slow and 2) we need to do an API call for each photo in the set to get the metadata.

I'm not fully understanding your logic change here. To build the list of files and folders, we'd have to call the Flickr API in the first place. And to know if a photo is downloaded we need it's metadata. At least, that's the current logic. Let me have a look at the API returns and see if we can optimize something here.

eggplantedd commented 2 years ago

Yeaah I thought a couple of hours after, getting a list of files would rely on calling the API in the first place.

So I would think the next best thing is to use a headless web browser to just scrape the folder/photo names and photo links. So just relying on the website being up and no API to foul.

I would be comfortable with comparing folder + file name as the check to see if it's been downloaded, but I'm wondering if some metadata could be scraped off the site- dimensions? You would have to tell me what's used in a check.

I was planning to do a similar thing if required- just feeding the program a list of album links I scraped.

And yes, totally understandable why it might not be working well here. Although getting to about 12,000 photos before really stumbling is pretty impressive!

beaufour commented 2 years ago

The slowness is a duplicate of #22 btw. So I'll keep track of it there. The intermittent API errors is a different challenge.

eggplantedd commented 2 years ago

Yes, I'm wondering if the issue here is actually the fact a 500 error throws you out of the program, rather than going for a re-attempt. It doesn't happen on any particular photo.

beaufour / flickr-download

Cannot complete download of large photosets - checking existing files requires connection which can fail + slow #63