NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5k stars 610 forks source link

Dealing with corrupt videos using experimental video decoder #5489

Open Tomsen1410 opened 1 month ago

Tomsen1410 commented 1 month ago

Version

1.35

Describe the bug.

I am using fn.experimental.decoders.video to decode videos stored in a web dataset. However, there exist files in my dataset that are corrupt and/or can't be openend by DALI. However, instead of throwing an error the entire process halts with a segmentation fault error when the decoder sees a corrupt video:

[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f0ccc298d00] moov atom not found
[/opt/dali/dali/operators/reader/loader/video/frames_decoder.cc:237] Failed to open video file memory filedue to Invalid data found when processing input
Segmentation fault (core dumped)

Essentially, this issue is similar to #5155, but for the experimental decoder instead of the video reader.

Minimum reproducible example

No response

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

JanuszL commented 1 month ago

Hi @Tomsen1410,

Thank you for reporting this. Can you tell if the videos are indeed corrupted by opening them in FFmpeg or this is just a DALI behavior? DALI operators work in the push mode, processing the whole batch at the time. So when DALi fails to process a given sample in the batch it cannot ask for more to replace the faulty one, so it throws an error. The only solution that comes to my mind is to provide an empty sample or zeroed one (as some operators may not handle empty tensors gracefully).

Tomsen1410 commented 1 month ago

Could you provide the ffmpeg command I should test on the video?

JanuszL commented 1 month ago

You can check this thread and see if FFmpeg can decode and save frames to a file.

Tomsen1410 commented 1 month ago

Ok, I have ran ffmpeg on the corrupted file and it throws the same error:

[mov,mp4,m4a,3gp,3g2,mj2 @ 0x559fa2226100] moov atom not found
[in#0 @ 0x559fa2225fc0] Error opening input: Invalid data found when processing input
Error opening input file /path/to/file.mp4.
Error opening input files: Invalid data found when processing input

I am using ffmpeg 6.1.1 installed from the conda-forge channel.

You can find the corrupted file attached.

https://github.com/NVIDIA/DALI/assets/15103267/f4d3216d-e825-49dc-975f-472d44dff41b

JanuszL commented 1 month ago

If FFmpeg cannot handle the video correctly I don't think we can do more than that. As you are using webdataset, you can manually edit the index file generated by wds2idx.py script to skip the mentioned sample. I also noticed that DALI doesn't provide a meaningful error message (ad crashes instead of raising an expectation) when it encounters a faulty file. Can you recheck the DALI nightly build once https://github.com/NVIDIA/DALI/pull/5491 is merged, check the offset to the faulty sample in the webdataset, and adjust the index file?

Tomsen1410 commented 1 month ago

Yes, that is exactly the Problem. I have no way of catching the error and the entire training process stops.

I will check, once it is merged. What exactly do you mean by adjusting the index file? When the decoder throws proper errors there is no need to alter the index file anymore, no?

JanuszL commented 1 month ago

@Tomsen1410 - the https://github.com/NVIDIA/DALI/pull/5491 has been merged. Please check the next nightly build to see if that helps.