filter_already_downloaded doesn't work

mbanani commented 2 years ago

Hi,

Thanks for the code release. I noticed that the filter_already downloaded function seems to have 2 issues:

it doesn't current operate as expected since the files seem to be missing a version which gets triggered as a mismatch. As a result, all files are redownloaded as noted by #48
Even if we comment out the version checks in cli/download.py (L207-212, L219-225). The check for number of downloads is incorrect since to_download at that point refers to an array of booleans which will always have the same length and is only used to filter the statements in the return statement using list(compress(downloads, to_download)). This only affects the log as the function returns the correctly filtered set.

I can submit a PR to remove the version checks and fix the to_download, however, I wasn't sure if there were plans to use version checking.

Thanks, Mohamed

ebyrne commented 2 years ago

@mbanani Hi Mohamed, and welcome to Ego4D! Thanks for the detailed question. It's on the list for improvement, but right now the version info is only written once at the end. So if it fails/is killed, it will indeed attempt to redownload and agreed that's not optimal given the size! (Would certainly take a PR if you have the cycles.)

On the second half though, I'm not sure I follow. Agreed that array is fixed size, it's just a selector array for the download array (and would need to be the same size). Certainly could be more elegant, but I'm missing the issue?

Did you download fully complete before you ran it again? And you still saw it redownloading? (I'm not seeing the same behavior, but could certainly believe there are bugs there, especially if the download connection is different - we have had reports that boto seems to be timing out on some of the downloads, and perhaps it's related behavior. Were there any errors on your first download attempt?

Appreciate the report, and would certainly appreciate and PR's for improvement!

mbanani commented 2 years ago

Hi Gene!

Thank you for the quick response.

As for the first half, I am not sure how to fix it other than a hack of turning off the version checks for now since it's still version 1 and there are other checks (size @ L227.

For the second half, I downloaded the version with the missing files that were noted by #48 which had 1515 missing videos. After the fix was done, I attempted to rerun the script, but it was going to redownload the full dataset. It was recognizing the files had be redownloaded already, but it was failing the version check at L207 as version_entry was None. After commenting out that check and the other check at L220 for s3_version, the correct set of files is returned by filter_already_downloaded but the check at L247 doesn't do anything since the lengths will always be the same, so it just prints out something incorrect ("No existing videos to filter") even if it filtered videos.

I'll submit a PR to fix the printout issue, but I am not sure what to do about the versioning.

ebyrne commented 2 years ago

Ah, so the version_id there is actually an S3 property (to uniquely identify the file instance if it is replaced) and not related to our version. So I wouldn't expect that to have changed - are you certain that's what's failing?

So after the 1515 videos, it was trying to redownload all 9k+ videos? Was it given a message for each that the version didn't match? (I'm assuming based on what you're saying.) Do you happen to have the output of the original run? (In particular I would imagine there was an error at the end - if all of the pooled connections failed I could see it failing even though it downloaded all the other files.)

Would you mind checking & upload the version file: ~/ego4d_data/v1/full_scale/manifest.ver (Or whatever base directory you used.)

Apologies on the poor first experience here! Appreciate ya debugging it with us. (I'll bump the incremental versions up our Eng list this week since it's clearly a pain point for a couple folks.)

ebyrne commented 2 years ago

And noted on the print statement, we'll fix that on our side. Should just be a sum rather than a len, but won't affect the outcome.

mbanani commented 2 years ago

No worries at all and thanks for the quick responses.

Regarding the print statement, I submitted a PR. Having it as a sum makes sense though.

As for the error, here are the last few lines before I get prompted to accept the download. This output shows up for all downloads I have had before. If I turn the version_check off, I get prompted to only download the subset of the data.

already_downloaded: no version entry for existing file: /nfs/shared_datasets/ego4d_data/v1/full_scale/8cf4fef4-e544-4311-aba0-463a95b53d84.mp4
already_downloaded: no version entry for existing file: /nfs/shared_datasets/ego4d_data/v1/full_scale/ad4f61f0-4c0f-4ce1-bdaa-e57b79250527.mp4
already_downloaded: no version entry for existing file: /nfs/shared_datasets/ego4d_data/v1/full_scale/7fdd82ac-ef37-4481-976b-04a550e62697.mp4
100%█████████████████████████████| 9650/9650 [01:41<00:00, 94.62file/s]
No existing videos to filter.
Downloading 9650/9650..
Expected size of downloaded files is 7172.5 GB. Do you want to start the download? ([y]/n)

I haven't tried redownloading the files, I deactivated the version check and started downloading the remaining files instead.

As for manifest.ver, I can't seem to find one. There's only the videos and manifest.csv in that directory. Could that be the issue?

ebyrne commented 2 years ago

Yes, that's the issue. Your download didn't finish the first time (irrespective of the failed downloads), and the version file wasn't written, that's why it's unable to bypass those files. I'll make that more fault tolerant this week, and I'm adding an option to bypass the version checks now. (Though it sounds like you're already sorted there.)

Thanks for the PR btw! I'll close it since we already updated internally (will be out probably tomorrow), but love to see it.

Let us know how your initial work with the dataset proceeds! -Gene

facebookresearch / Ego4d

filter_already_downloaded doesn't work #51