facebookresearch / Ego4d

Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
https://ego4d-data.org/docs/
MIT License
355 stars 47 forks source link

Canonical clips details #132

Closed sammy-su closed 1 year ago

sammy-su commented 2 years ago

I wonder when will the details for canonical clips be available? In particular, how were they extracted from canonical videos, and whether there's any transcoding in the process? We encountered some inconsistency when processing the two data and would like to figure out the cause.

devanshk commented 2 years ago

Hey @sammy-su, I'll update the wiki shortly.

In the meantime: the Canonical Clips were generated by transcoding the canonical videos using the vp9 codec and crf 18. Specifically, we decode each frame into an ndarray, then encode it into a video stream in the output container - all using PyAV.

The transcode does mean the canonical clips are not byte-wise identical to their section of the canonical videos, but it's necessary since the clip start/end points don't line up with keyframes so we must encode new ones.

What inconsistencies are you seeing? We can certainly dive into them and see what's going on.

sammy-su commented 2 years ago

We tried to extract the audio data, and it turns out that the difference between full_scale and clip is not neglectable.

For example, when I compare the following audio segments from clip_uid cae37cbc-7ff0-40ea-b3a4-6e6a551f01ab:

  1. clip 00:18~00:21
  2. clip 00:21~00:24
  3. full_scale 00:18~00:21

the 2-norm between 1 and 3 is larger than 1 and 2. While 2-norm might not be a good way to compare audio, I still expect 1 and 3 to be closer given that they should differ only by a small offset. Therefore, I wonder if the audio is also transcoded?

devanshk commented 2 years ago

We do transcode audio in the same way. For audio frames, we use the AAC format which has a good explanation here.

The difference does seem a bit strange, two questions:

  1. This clip segment starts at 592s in its parent. Are you comparing 00:18 - 00:21 in the clip to 9:52 - 9:55 in the parent?
  2. Do you have a notebook or something to help us replicate the 2-norm differences?
sammy-su commented 2 years ago
  1. Yes, the audio from the full_scale video is based on the parent time. I actually manually check the audio and couldn't tell their difference.
  2. I first extract the audio for the entire video using ffmpeg using copy option. The audio for the clip is then extracted using
    
    from scipy.io import wavfile

with tf.io.gfile.GFile(path, 'rb') as f: sample_rate, wav = wavfile.read(f) wav = np.asfarray(wav, dtype=np.float32)

audio_start = int(sample_rate start_frame / 30.0) audio_end = int(sample_rate end_frame / 30.0 + 1) audio = wav[audio_start:audio_end]

After extracting the audio for each clip, the values are compared using
```python
# delta clip / full_scale
delta = tf.signal.rfft(clip_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[:160000, 0])
print(np.linalg.norm(delta))

# delta full_scale / full_scale
delta = tf.signal.rfft(full_scale_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[160000:320000, 0])
print(np.linalg.norm(delta))
  1. Another evidence showing the difference is that when we tried to use audio data for the Object-state-change classification task, we observe that:
Train Validation Accuracy
clip clip ~77%
clip full_scale ~66%
full_scale full_scale ~70%
full_scale clip ~70%

We believe clip data introduces information leak because clip video only contains the positive samples.

miguelmartin75 commented 1 year ago

Hi all, we have documentation for the canonical clips located here: https://ego4d-data.org/docs/data/videos/#canonical-clips

Apologies on the delay.