facebookresearch / VMZ

VMZ: Model Zoo for Video Modeling
Apache License 2.0
1.04k stars 155 forks source link

Process aborting with "Insufficient data to determine video format" error when fine-tuning #13

Closed think-high closed 6 years ago

think-high commented 6 years ago

Hi, I am trying to fine tune the pre-trained(Kinetics) R2Plus1D model on my dataset. I created the train and test LMDB of my dataset like this:

#Creating Training LMDB
python /home/rahul/R2Plus1D/data/create_video_db.py --list_file=/home/rahul/Dataset/train_test_lists/train_list.csv --output_file=/home/rahul/Dataset/train_test_lists/LMDB_Training --use_list=1

#Creating testing LMDB
python /home/rahul/R2Plus1D/data/create_video_db.py --list_file=/home/rahul/Dataset/train_test_lists/test_list.csv --output_file=/home/rahul/Dataset/train_test_lists/LMDB_Testing --use_list=1

The format of the csv files is: org_video("the path of videos"),label("integer label of the video")

And then I run this to train the model:

python /home/rahul/R2Plus1D/tools/train_net.py \
--train_data=/home/rahul/Dataset/train_test_lists/LMDB_Training \
--test_data=/home/rahul/Dataset/train_test_lists/LMDB_Testing \
--model_name=r2plus1d --model_depth=18 \
--clip_length_rgb=16 --batch_size=4 \
--pretrained_model=/home/rahul/R2Plus1D/pre-trained-models/r2.5d_d18_l16.pkl \
--db_type='pickle' --is_checkpoint=0 \
--gpus=0,1 --base_learning_rate=0.0002 \
--epoch_size=40000 --num_epochs=8 --step_epoch=2 \
--weight_decay=0.005 --num_labels=14

But the process is getting aborted with these errors:

E0612 23:00:22.106964  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107067  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106992  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106918  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107008  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107035  9414 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106915  9408 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469903  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469926  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469921  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469907  9410 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469923  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469954  9414 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469923  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469995  9408 video_decoder.cc:75] Insufficient data to determine video format

.......... .......... ......... ......... .........

 Encountered CUDA error: device-side assert triggered Error from operator:
input: "gpu_0/comp_4_spatbn_1" input: "gpu_0/comp_4_conv_2_middle_w" input: "gpu_0/__m1_shared" output: "gpu_0/comp_4_conv_2_middle_w_grad" output: "gpu_0/__m2_shared" name: "" type: "ConvGradient" arg { name: "no_bias" i: 1 } arg { name: "kernels" ints: 1 ints: 3 ints: 3 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "exhaustive_search" i: 1 } arg { name: "strides" ints: 1 ints: 1 ints: 1 } arg { name: "pads" ints: 0 ints: 1 ints: 1 ints: 0 ints: 1 ints: 1 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
E0612 23:00:23.986739  9419 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'SoftmaxWithLoss'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: device-side assert triggered Error from operator:
input: "gpu_1/last_out_L14" input: "gpu_1/label" output: "gpu_1/softmax" output: "gpu_1/loss" name: "" type: "SoftmaxWithLoss" device_option { device_type: 1 cuda_gpu_id: 1 }
F0612 23:00:23.990535  9416 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
*** Check failure stack trace: ***
F0612 23:00:23.990537  9418 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990561  9420 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990689  9422 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990710  9421 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990717  9417 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.991060  9419 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
*** Check failure stack trace: ***
F0612 23:00:23.990537  9418 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990561  9420 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990689  9422 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990710  9421 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990717  9417 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.991060  9419 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
MyScripts/trainWithPretrainedModel.sh: line 10:  9358 Aborted 

Here is what stdout dump that I passed to a file:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file

Can you please help me out of this?

Thanks, Rahul Bhojwani

dutran commented 6 years ago

Looks like you failed at "SoftmaxWithLoss" layer. In your command, you have --num_labels=89, the layer name at the error is "gpu_1/last_out_L14", that mean you prediction layer has only 14 outputs at softmax, but you label goes out of scope e.g. label >=14. Then you hit this error. Check careful why you input --num_labels=89 why the net still create the last layer with 14 outputs.

murilovarges commented 6 years ago

The integer label of the video must start from 0, here happened some errors when I started from 1.

think-high commented 6 years ago

@dutran : Actually the --num_labels was set to 14 in the script I ran. I just copied the other script here by mistake. My bad. Will correct it now.

@murilovarges : I actually have my labels starting from 1. Let me try this.

dutran commented 6 years ago

Then, changing your labels to start from 0 as @murilovarges suggested will solve your issue.

think-high commented 6 years ago

Hey. So, I did that and the process is not getting aborted and the training is running. But these warnings are still constantly being generated:

E0612 23:00:22.106964  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107067  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106992  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106918  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107008  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107035  9414 video_decoder.cc:75] Insufficient data to determine video format

Can you help me understand the reason and whether it is a problem or not?

Thanks

dutran commented 6 years ago

It looks like, your video does not have some meta data, decoder does not know how to decode.

murilovarges commented 6 years ago

This message is in https://github.com/pytorch/pytorch/blob/master/caffe2/video/video_decoder.cc#L75.

I guess can happen when the video is very small too.

murilovarges commented 6 years ago

Similar issue in https://github.com/facebookresearch/video-nonlocal-net/issues/12

think-high commented 6 years ago

Oh great. This is very helpful. Thanks a lot @dutran and @murilovarges . 👍