facebookresearch / Ego4d

Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
https://ego4d-data.org/docs/
MIT License
348 stars 47 forks source link

Need checksum/hash for videos #127

Open YouJiacheng opened 2 years ago

YouJiacheng commented 2 years ago

I want to check the integrity of downloaded data, and I tried:

from hashlib import md5
import mmap
from pathlib import Path
from urllib.parse import urlparse

import boto3

s3 = boto3.resource('s3')

def bucket_key_from_s3path(s3path: str):
    o = urlparse(s3path)
    return o.netloc, o.path.lstrip('/')

bucket, key = bucket_key_from_s3path('s3://ego4d-minnesota/public/v1/full_scale/77cc4654-4eec-44c6-af05-dbdf71f9a401')
obj = s3.Object(bucket, key)
print(obj.e_tag)
with open(Path('~/ego4d_data/v1/full_scale/77cc4654-4eec-44c6-af05-dbdf71f9a401.mp4').expanduser(), 'rb') as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        print(md5(mm).hexdigest())

print(obj.content_length)
print(Path('~/ego4d_data/v1/full_scale/77cc4654-4eec-44c6-af05-dbdf71f9a401.mp4').expanduser().stat().st_size)
# "874ea984665bb7e177ed6802ffcc5251-58"
# 1d49c7c73941d3a09c658c37f2d94270
# 481559187
# 481559187

However, since videos are uploaded in multipart, e_tag is not the md5 of the video, and the calculation is non-trivial. Thus I can merely use content_length to check the integrity, which is not reliable.

miguelmartin75 commented 2 years ago

The CLI checks the filesize already for filtering purposes and it additionally checks the corresponding version. I believe why this was done this way, is because as you have discovered, getting a hash of a video is non-trivial on S3.

You can see the corresponding code here: https://github.com/facebookresearch/Ego4d/blob/main/ego4d/cli/download.py#L215-L236

YouJiacheng commented 2 years ago

I know the CLI checks the filesize and have read the corresponding code. But my situation is rather awkward: My compute-server is in China, and cannot download data from AWS. So I make use of a data-server to download data from AWS. However I cannot compute hash on my data-server(due to some reasons). As a result I cannot verify the transfer between my data-server and compute-server. Ego4D team can compute hash of videos locally, and that can be helpful. Moreover, the integrity can be promised by upload to S3 with hash checksum.

ebyrne commented 2 years ago

Can you confirm that you can now download directly in China via the CLI?

https://discuss.ego4d-data.org/t/cli-updates-improved-china-access/128

YouJiacheng commented 2 years ago

Yes! It is fast ~and smooth~ in general. I download the annotations(2.5G). Peak speed is 800Mb/s (100MB/s). It takes 1m50s to download 99%, but the last 1% takes 5m20s. Still satisfactory!

gahgdug commented 1 year ago

Can you tell me the details about the download in China?Especially the settings of the two parameters of aws configure:'defaut regoin name' and 'defaut output format'.