Closed sammy-su closed 1 year ago
Hey @sammy-su, I'll update the wiki shortly.
In the meantime: the Canonical Clips were generated by transcoding the canonical videos using the vp9 codec and crf 18. Specifically, we decode each frame into an ndarray, then encode it into a video stream in the output container - all using PyAV.
The transcode does mean the canonical clips are not byte-wise identical to their section of the canonical videos, but it's necessary since the clip start/end points don't line up with keyframes so we must encode new ones.
What inconsistencies are you seeing? We can certainly dive into them and see what's going on.
We tried to extract the audio data, and it turns out that the difference between full_scale and clip is not neglectable.
For example, when I compare the following audio segments from clip_uid cae37cbc-7ff0-40ea-b3a4-6e6a551f01ab:
the 2-norm between 1 and 3 is larger than 1 and 2. While 2-norm might not be a good way to compare audio, I still expect 1 and 3 to be closer given that they should differ only by a small offset. Therefore, I wonder if the audio is also transcoded?
We do transcode audio in the same way. For audio frames, we use the AAC format which has a good explanation here.
The difference does seem a bit strange, two questions:
copy
option. The audio for the clip is then extracted using
from scipy.io import wavfile
with tf.io.gfile.GFile(path, 'rb') as f: sample_rate, wav = wavfile.read(f) wav = np.asfarray(wav, dtype=np.float32)
audio_start = int(sample_rate start_frame / 30.0) audio_end = int(sample_rate end_frame / 30.0 + 1) audio = wav[audio_start:audio_end]
After extracting the audio for each clip, the values are compared using
```python
# delta clip / full_scale
delta = tf.signal.rfft(clip_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[:160000, 0])
print(np.linalg.norm(delta))
# delta full_scale / full_scale
delta = tf.signal.rfft(full_scale_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[160000:320000, 0])
print(np.linalg.norm(delta))
Train | Validation | Accuracy |
---|---|---|
clip | clip | ~77% |
clip | full_scale | ~66% |
full_scale | full_scale | ~70% |
full_scale | clip | ~70% |
We believe clip data introduces information leak because clip video only contains the positive samples.
Hi all, we have documentation for the canonical clips located here: https://ego4d-data.org/docs/data/videos/#canonical-clips
Apologies on the delay.
I wonder when will the details for canonical clips be available? In particular, how were they extracted from canonical videos, and whether there's any transcoding in the process? We encountered some inconsistency when processing the two data and would like to figure out the cause.