RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials

Christinepan881 commented 2 years ago

When I use the MViT config to run the code on K400 dataset, I just met the errors: ... Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 3 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 5 Failed to decode video idx 72108 from /data/k400/train/filling_eyebrows/1m50SSGbG2k_000148_000158.avi; trial 99 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 15 Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 95 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 6 Failed to decode video idx 205437 from /data/k400/train/taking_a_shower/U540GFOTF6U_000002_000012.avi; trial 99 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 16 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 7 Failed to decode video idx 154000 from /data/k400/train/punching_bag/BNwpN8GFixE_000010_000020.avi; trial 0 Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 96 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 4 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 17 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 8 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 5 Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 97 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 18 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 6 Failed to decode video idx 204993 from /data/k400/train/tai_chi/qV7j-jQCH3M_000027_000037.avi; trial 0 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 7 Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 98 Traceback (most recent call last): File "tools/run_net.py", line 45, in main() File "tools/run_net.py", line 26, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "/data/home/SlowFast/slowfast/utils/misc.py", line 296, in launch_job torch.multiprocessing.spawn( File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/data/home/SlowFast/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) File "/data/home/SlowFast/tools/train_net.py", line 708, in train train_epoch( File "/data/home/SlowFast/tools/train_net.py", line 86, in train_epoch for cur_iter, (inputs, labels, index, time, meta) in enumerate( File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/home/SlowFast/slowfast/datasets/kinetics.py", line 488, in getitem raise RuntimeError( RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials

I have checked with the data paths, and there is no problem with the path.

Anyone know the reason? Thanks!

kkk55596 commented 2 years ago

Hi, have you solved this problem already? I also meet this problem.

alpargun commented 2 years ago

This is due to torchvision backend during video decoding. Some people mentioned that building torchvision from source solves this issue, however, I haven't been able to fix it yet. This issue already discusses this problem and a possible solution is to change the video decoding backend to PyAV instead. In the YAML config file, you can add:

DATA:
  DECODING_BACKEND: pyav

to switch to the PyAV backend. However, PyAV backend introduces another error related to changed data types that is due to a recent commit, so this pull request already solves this problem. I did the necessary changes in the given pull request and now I am able to run the framework with the PyAv backend.

haooooooqi commented 2 years ago

Thanks for playing with pysf. You might get the issue fixed if you preprocess the video to the same format?

kkk55596 commented 2 years ago

I solved this problem after re-installing torchvision from source. Thus, I can use the following method. DATA: DECODING_BACKEND: torchvision

alpargun commented 2 years ago

Which torch and torchvision version are you using? Thanks

poincarelee commented 2 years ago

The pull request you mentioned did solve the problem.
I met another problem: the top 1 error( also top5 error) seems not to decrease straightly, in certain epoch, top1 error was 37.5% ,while during some later epoch, top1 error became 50%, and the final accuracy(top 1 acc) is 42.14%(top5 acc: 72.81) which is much smaller than reported in paper, just as follows:

I trained X3D on HMDB51 dataset. Anything wrong with the training code?

alpargun commented 2 years ago

I haven't trained on the HMDB51 dataset yet but I am assuming two possibilities:

Please check the paper again to see if they did pre-training on another dataset to obtain the published results
HMDB51 does not have a config file in the SlowFast repo. Hence, the values of parameters in your config can affect the performance because for the other datasets, SlowFast already provides config files that are tuned for the corresponding datasets

poincarelee commented 2 years ago

You are right. Kinetics and AVA datasets are preferred. I referred to other dataset's config file(like Kinetics') and changed it for hmdb51. K400 is still a little larger, training would be much longer. However, I am now working on K400 and choose about 10% for training, which still needs about 3 days.

poincarelee commented 2 years ago

@alpargun. Hi, I have trained on K400 dataset, while the top1-error and top5-error seems weird.

As shown from the picture above, epoch number is 105, top1 error is still 81.25% in certain batch, while in some other batch it's 56% or 43%. Most of batches during one epoch are nearly 50% but there's always some batch being 80% or 70%, top5 error also vibrates but doesn't show such a trend. Have you met this problem before?

Patrick-CH commented 1 year ago

Thanks for playing with pysf. You might get the issue fixed if you preprocess the video to the same format?

I have tried. Even if i preprocess the videos to the same format .mp4, the problem still exists.

alpargun commented 1 year ago

Hi, you might find the INSTALL.md file in my SlowFast fork useful for updated installation steps. I would suggest PyTorch <= 1.13.1 as I had similar problems with 2.0.

Following the INSTALL.md file, I suggest installing PyTorch together with TorchVision. I recently set up SlowFast on multiple Ubuntu 20.04 machines and a MacBook following this updated INSTALL.md, and had no problems.

ConvAndConv commented 2 months ago

i face same question, torch==2.0.0 ,torchvision==0.15.1, i use Kinetics config and slowfast_8*8_r50.yaml, how can i fix it without lowered torch version?Thanks!

facebookresearch / SlowFast

RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials #558