training ir-csn152 with ucf-101

Jinyong-Huang commented 4 years ago

I don't know why the training was interrupted.

INFO:model_loader:copying comp_45_spatbn_4_rm INFO:model_loader:copying comp_45_spatbn_4_riv INFO:model_loader:copying comp_46_spatbn_1_rm INFO:model_loader:copying comp_46_spatbn_1_riv INFO:model_loader:copying comp_46_spatbn_3_rm INFO:model_loader:copying comp_46_spatbn_3_riv INFO:model_loader:copying comp_46_spatbn_4_rm INFO:model_loader:copying comp_46_spatbn_4_riv INFO:model_loader:copying comp_47_spatbn_1_rm INFO:model_loader:copying comp_47_spatbn_1_riv INFO:model_loader:copying comp_47_spatbn_3_rm INFO:model_loader:copying comp_47_spatbn_3_riv INFO:model_loader:copying comp_47_spatbn_4_rm INFO:model_loader:copying comp_47_spatbn_4_riv INFO:model_loader:copying shortcut_projection_47_spatbn_rm INFO:model_loader:copying shortcut_projection_47_spatbn_riv INFO:model_loader:copying comp_48_spatbn_1_rm INFO:model_loader:copying comp_48_spatbn_1_riv INFO:model_loader:copying comp_48_spatbn_3_rm INFO:model_loader:copying comp_48_spatbn_3_riv INFO:model_loader:copying comp_48_spatbn_4_rm INFO:model_loader:copying comp_48_spatbn_4_riv INFO:model_loader:copying comp_49_spatbn_1_rm INFO:model_loader:copying comp_49_spatbn_1_riv INFO:model_loader:copying comp_49_spatbn_3_rm INFO:model_loader:copying comp_49_spatbn_3_riv INFO:model_loader:copying comp_49_spatbn_4_rm INFO:model_loader:copying comp_49_spatbn_4_riv INFO:data_parallel_model:Creating checkpoint synchronization net INFO:data_parallel_model:Run checkpoint net INFO:train_net:Starting epoch 0/100

Process finished with exit code 139

Jinyong-Huang commented 4 years ago

gnoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file. [E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. INFO:train_net:Namespace(base_learning_rate=0.0001, batch_size=4, bottleneck_multiplier=1.0, channel_multiplier=1.0, clip_length_of=8, clip_length_rgb=32, conv1_temporal_kernel=3, conv1_temporal_stride=1, crop_size=224, cudnn_workspace_limit_mb=64, db_type='pickle', display_iter=100, do_flow_aggregation=0, epoch_size=110000, file_store_path='/data/video_caption_database/Load_Model', flow_data_type=0, frame_gap_of=2, gamma=0.1, get_video_id=0, gpus='0,1,2,3,7', input_type=False, is_checkpoint=0, jitter_scales='128,160', load_model_path='/data/video_caption_database/Load_Model/irCSN_152_ft_kinetics_from_ig65m_f126851907.pkl', model_depth=152, model_name='ir-csn', multi_label=0, num_channels=3, num_decode_threads=4, num_epochs=100, num_gpus=1, num_labels=101, pred_layer_name=None, profiling=0, sampling_rate_of=2, sampling_rate_rgb=1, save_model_name='ir-csn-152', scale_h=342, scale_w=256, step_epoch=10, test_data='/data/video_caption_database/UCF/UCF101_Action_detection_splits/testlist02', train_data='/data/video_caption_database/UCF/UCF101_Action_detection_splits/trainlist02', use_cudnn=1, use_dropout=0, use_local_file=0, use_pool1=1, video_res_type=1, weight_decay=0.005) INFO:model_builder:Validated: ir-csn with 152 layers INFO:model_builder:with input 32x224x224 INFO:train_net:Running on GPUs: [0, 1, 2, 3, 7] INFO:train_net:Using epoch size: 110000 WARNING:root:[====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information. INFO:train_net:Training set has 9536 examples INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3, 7] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:model_helper:outputing rgb data INFO:model_builder:creating ir-csn, depth=152... INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 256 INFO:video_model:in: 256 out: 64 INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 256 .........

INFO:model_loader:copying comp_49_spatbn_1_riv INFO:model_loader:copying comp_49_spatbn_3_rm INFO:model_loader:copying comp_49_spatbn_3_riv INFO:model_loader:copying comp_49_spatbn_4_rm INFO:model_loader:copying comp_49_spatbn_4_riv INFO:data_parallel_model:Creating checkpoint synchronization net INFO:data_parallel_model:Run checkpoint net INFO:train_net:Starting epoch 0/100

Process finished with exit code 139

dutran commented 4 years ago

There are many cases can lead to this kind of error. Can you try to do the same thing with only --gpus=0 and see if you get the same error?

Jinyong-Huang commented 4 years ago

There are many cases can lead to this kind of error. Can you try to do the same thing with only --gpus=0 and see if you get the same error?

The reason should be out of GPU memory. when i set crop_size=112, use_pool1=0, batch_size=4, clip_length_rgb=16, jitter_scales='128,160'(It should not work) ...... It can run successfully. The memory usage is about 9400M(total of 12G). but I set jitter_scales='70,90', the error is Segmentation fault(Same as before). Compared with pytorch and others, I see a memory usage that's many times larger running the exact same model.. why????

bjuncek commented 4 years ago

This sounds inconsistent with my observations - pytorch usually pre-allocates a bit memory as it has multiple ops reserved for running on GPUs. I'd suggest trying the parameters that work on single and multi gpu setting, and see which parameter change causes the code to break.

In regards to the error stack, nothing in your error stack points to a OOM error - rather like GLOO or NCCL is not supported and/or mismatched. Additionally, OOM doesn't usually cause segfault.

oLIVIa-Ld commented 4 years ago

gnoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file. [E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. INFO:train_net:Namespace(base_learning_rate=0.0001, batch_size=4, bottleneck_multiplier=1.0, channel_multiplier=1.0, clip_length_of=8, clip_length_rgb=32, conv1_temporal_kernel=3, conv1_temporal_stride=1, crop_size=224, cudnn_workspace_limit_mb=64, db_type='pickle', display_iter=100, do_flow_aggregation=0, epoch_size=110000, file_store_path='/data/video_caption_database/Load_Model', flow_data_type=0, frame_gap_of=2, gamma=0.1, get_video_id=0, gpus='0,1,2,3,7', input_type=False, is_checkpoint=0, jitter_scales='128,160', load_model_path='/data/video_caption_database/Load_Model/irCSN_152_ft_kinetics_from_ig65m_f126851907.pkl', model_depth=152, model_name='ir-csn', multi_label=0, num_channels=3, num_decode_threads=4, num_epochs=100, num_gpus=1, num_labels=101, pred_layer_name=None, profiling=0, sampling_rate_of=2, sampling_rate_rgb=1, save_model_name='ir-csn-152', scale_h=342, scale_w=256, step_epoch=10, test_data='/data/video_caption_database/UCF/UCF101_Action_detection_splits/testlist02', train_data='/data/video_caption_database/UCF/UCF101_Action_detection_splits/trainlist02', use_cudnn=1, use_dropout=0, use_local_file=0, use_pool1=1, video_res_type=1, weight_decay=0.005) INFO:model_builder:Validated: ir-csn with 152 layers INFO:model_builder:with input 32x224x224 INFO:train_net:Running on GPUs: [0, 1, 2, 3, 7] INFO:train_net:Using epoch size: 110000 WARNING:root:[====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information. INFO:train_net:Training set has 9536 examples INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3, 7] INFO:data_parallel_model:Create input and model training operators INFO:data_parallel_model:Model for GPU : 0 INFO:model_helper:outputing rgb data INFO:model_builder:creating ir-csn, depth=152... INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 256 INFO:video_model:in: 256 out: 64 INFO:video_model:in: 64 out: 64 INFO:video_model:in: 64 out: 256 .........

INFO:model_loader:copying comp_49_spatbn_1_riv INFO:model_loader:copying comp_49_spatbn_3_rm INFO:model_loader:copying comp_49_spatbn_3_riv INFO:model_loader:copying comp_49_spatbn_4_rm INFO:model_loader:copying comp_49_spatbn_4_riv INFO:data_parallel_model:Creating checkpoint synchronization net INFO:data_parallel_model:Run checkpoint net INFO:train_net:Starting epoch 0/100

Process finished with exit code 139

hello, i tried to finetune the model but the train_net.py crashed. could you send me copy of your scripts. i would be very thankful of you. QQ:469737944

Jinyong-Huang commented 4 years ago

ok

dutran commented 4 years ago

The reason should be out of GPU memory. when i set crop_size=112, use_pool1=0, batch_size=4, clip_length_rgb=16, jitter_scales='128,160'(It should not work) ...... It can run successfully.

This seems reasonable since input size is smaller.

The memory usage is about 9400M(total of 12G). but I set jitter_scales='70,90', the error is Segmentation fault(Same as before).

This error is not about memory, but about using crop_size bigger than the scales. Remember, you scale the frame to 70 or 90 with smaller edges, then crop to 112x112 (how this possible). Then probably you hit to an out-of-bound memory address -> Seg fault

facebookresearch / VMZ

training ir-csn152 with ucf-101 #88