facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.64k stars 1.22k forks source link

RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials #558

Open Christinepan881 opened 2 years ago

Christinepan881 commented 2 years ago

When I use the MViT config to run the code on K400 dataset, I just met the errors: ... Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 3 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 5 Failed to decode video idx 72108 from /data/k400/train/filling_eyebrows/1m50SSGbG2k_000148_000158.avi; trial 99 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 15 Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 95 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 6 Failed to decode video idx 205437 from /data/k400/train/taking_a_shower/U540GFOTF6U_000002_000012.avi; trial 99 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 16 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 7 Failed to decode video idx 154000 from /data/k400/train/punching_bag/BNwpN8GFixE_000010_000020.avi; trial 0 Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 96 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 4 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 17 Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 8 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 5 Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 97 Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 18 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 6 Failed to decode video idx 204993 from /data/k400/train/tai_chi/qV7j-jQCH3M_000027_000037.avi; trial 0 Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 7 Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 98 Traceback (most recent call last): File "tools/run_net.py", line 45, in main() File "tools/run_net.py", line 26, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "/data/home/SlowFast/slowfast/utils/misc.py", line 296, in launch_job torch.multiprocessing.spawn( File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/data/home/SlowFast/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) File "/data/home/SlowFast/tools/train_net.py", line 708, in train train_epoch( File "/data/home/SlowFast/tools/train_net.py", line 86, in train_epoch for cur_iter, (inputs, labels, index, time, meta) in enumerate( File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/home/SlowFast/slowfast/datasets/kinetics.py", line 488, in getitem raise RuntimeError( RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials

I have checked with the data paths, and there is no problem with the path.

Anyone know the reason? Thanks!

kkk55596 commented 2 years ago

Hi, have you solved this problem already? I also meet this problem.

alpargun commented 2 years ago

This is due to torchvision backend during video decoding. Some people mentioned that building torchvision from source solves this issue, however, I haven't been able to fix it yet. This issue already discusses this problem and a possible solution is to change the video decoding backend to PyAV instead. In the YAML config file, you can add:

DATA:
  DECODING_BACKEND: pyav

to switch to the PyAV backend. However, PyAV backend introduces another error related to changed data types that is due to a recent commit, so this pull request already solves this problem. I did the necessary changes in the given pull request and now I am able to run the framework with the PyAv backend.

haooooooqi commented 2 years ago

Thanks for playing with pysf. You might get the issue fixed if you preprocess the video to the same format?

kkk55596 commented 2 years ago

I solved this problem after re-installing torchvision from source. Thus, I can use the following method. DATA: DECODING_BACKEND: torchvision

alpargun commented 2 years ago

Which torch and torchvision version are you using? Thanks

poincarelee commented 2 years ago

The pull request you mentioned did solve the problem.
I met another problem: the top 1 error( also top5 error) seems not to decrease straightly, in certain epoch, top1 error was 37.5% ,while during some later epoch, top1 error became 50%, and the final accuracy(top 1 acc) is 42.14%(top5 acc: 72.81) which is much smaller than reported in paper, just as follows:

image I trained X3D on HMDB51 dataset. Anything wrong with the training code?

alpargun commented 2 years ago

I haven't trained on the HMDB51 dataset yet but I am assuming two possibilities:

poincarelee commented 2 years ago

You are right. Kinetics and AVA datasets are preferred. I referred to other dataset's config file(like Kinetics') and changed it for hmdb51. K400 is still a little larger, training would be much longer. However, I am now working on K400 and choose about 10% for training, which still needs about 3 days.

poincarelee commented 2 years ago

@alpargun. Hi, I have trained on K400 dataset, while the top1-error and top5-error seems weird.

image As shown from the picture above, epoch number is 105, top1 error is still 81.25% in certain batch, while in some other batch it's 56% or 43%. Most of batches during one epoch are nearly 50% but there's always some batch being 80% or 70%, top5 error also vibrates but doesn't show such a trend. Have you met this problem before?

Patrick-CH commented 1 year ago

Thanks for playing with pysf. You might get the issue fixed if you preprocess the video to the same format?

I have tried. Even if i preprocess the videos to the same format .mp4, the problem still exists.

alpargun commented 1 year ago

Hi, you might find the INSTALL.md file in my SlowFast fork useful for updated installation steps. I would suggest PyTorch <= 1.13.1 as I had similar problems with 2.0.

Following the INSTALL.md file, I suggest installing PyTorch together with TorchVision. I recently set up SlowFast on multiple Ubuntu 20.04 machines and a MacBook following this updated INSTALL.md, and had no problems.

ConvAndConv commented 2 months ago

i face same question, torch==2.0.0 ,torchvision==0.15.1, i use Kinetics config and slowfast_8*8_r50.yaml, how can i fix it without lowered torch version?Thanks!