Hi, this is an excellent job!
I am using UCF101 as training and testing data. After compiled and installed caffe2 with the new video module in caffe2_customized_ops folder, I can run the code. However, I met some error before the training started. Here is the output in command:
[INFO: checkpoints.py: 128]: No checkpoint found; training from scratch...
[INFO: train_net_video.py: 127]: ------------- Training model... -------------
[INFO: metrics.py: 57]: Resetting train metrics...
[swscaler @ 0x7ea2c0021be0] (null) is not supported as input pixel format
[swscaler @ 0x7ea2b4030a20] (null) is not supported as input pixel format
[IMGUTILS @ 0x7fa2d33e6bc0] Picture size 1x0 is invalid
[IMGUTILS @ 0x7fa2d33e6bc0] Picture size 1x0 is invalid
** Aborted at 1530171970 (unix time) try "date -d @1530171970" if you are using GNU date
[IMGUTILS @ 0x7fa2d1fd4bc0] Picture size 1x0 is invalid
[IMGUTILS @ 0x7fa2d1fd4bc0] Picture size 1x0 is invalid
PC: @ 0x7fa30d0bfd1a sws_scale
SIGSEGV (@0x40) received by PID 40756 (TID 0x7fa2d33e9700) from PID 64; stack trace:
@ 0x7fa389b72390 (unknown)
@ 0x7fa30d0bfd1a sws_scale
@ 0x7fa3793eece0 caffe2::CustomVideoDecoder::decodeLoop()
@ 0x7fa3793f058e caffe2::CustomVideoDecoder::decodeFile()
@ 0x7fa3793fb187 caffe2::DecodeClipFromVideoFileFlex()
@ 0x7fa342c206fc caffe2::CustomizedVideoInputOp<>::GetClipAndLabelFromDBValue()
@ 0x7fa342c20b70 caffe2::CustomizedVideoInputOp<>::DecodeAndTransform()
@ 0x7fa342c1d8b3 std::_Function_handler<>::_M_invoke()
@ 0x7fa342bf02eb caffe2::TaskThreadPool::main_loop()
@ 0x7fa3124c18f0 (unknown)
@ 0x7fa389b686ba start_thread
@ 0x7fa38918e41d clone
Segmentation fault (core dumped)**
I have checked with the path of lmdb, and it is true. The error happens in this line.
More details of my experiment:
I prepare the data just as mentioned in DATASET.md, here are the steps:
(1) divide the data into train set(70%) and test set(30%)
(2) shuffle train set
(3) create lmdb with create_video_lmdb.py
I use only 1 GPU and modified the NUM_GPUS variable to 1 in configs/DBG_kinetics_resnet_8gpu_c2d_nonlocal_affine_400k.yaml.
I did not use the pre-trained model, so it trains from scratch.
My script of running the program are as follows:
CHECKPOINT_DIR=../data/checkpoints/run_i3d_nlnet_affine_400k_128f
mkdir ${CHECKPOINT_DIR}
python ../tools/train_net_video.py \
--config_file ../configs/DBG_kinetics_resnet_8gpu_c2d_nonlocal_affine_400k.yaml \
VIDEO_DECODER_THREADS 2 \
NONLOCAL.CONV3_NONLOCAL True \
NONLOCAL.CONV4_NONLOCAL True \
TRAIN.VIDEO_LENGTH 128 \
TRAIN.SAMPLE_RATE 1 \
TEST.VIDEO_LENGTH 128 \
TEST.SAMPLE_RATE 1 \
MODEL.MODEL_NAME resnet_video_org \
MODEL.VIDEO_ARC_CHOICE 2 \
TRAIN.DROPOUT_RATE 0.5 \
CHECKPOINT.DIR ${CHECKPOINT_DIR} \
DATADIR /home/lyj/video-nonlocal-net-master/data/lmdb/kinetics_lmdb_multicrop/ \
FILENAME_GT /home/lyj/vclf/data/ucfTrainTestlist/nltestlist.txt \
2>&1 | tee ${CHECKPOINT_DIR}/log.txt
The reason of this error may be: (1) UCF101 dataset is not suitable for the code, and it needs to be processed before training. (2) caffe2 is not well installed....
If you need any data or details of my experiment, feel free to tell me. Hope for your response, Thanks!
Dear Xiaolonw,
Hi, this is an excellent job!
I am using UCF101 as training and testing data. After compiled and installed caffe2 with the new video module in caffe2_customized_ops folder, I can run the code. However, I met some error before the training started. Here is the output in command: [INFO: checkpoints.py: 128]: No checkpoint found; training from scratch... [INFO: train_net_video.py: 127]: ------------- Training model... ------------- [INFO: metrics.py: 57]: Resetting train metrics... [swscaler @ 0x7ea2c0021be0] (null) is not supported as input pixel format [swscaler @ 0x7ea2b4030a20] (null) is not supported as input pixel format [IMGUTILS @ 0x7fa2d33e6bc0] Picture size 1x0 is invalid [IMGUTILS @ 0x7fa2d33e6bc0] Picture size 1x0 is invalid ** Aborted at 1530171970 (unix time) try "date -d @1530171970" if you are using GNU date [IMGUTILS @ 0x7fa2d1fd4bc0] Picture size 1x0 is invalid [IMGUTILS @ 0x7fa2d1fd4bc0] Picture size 1x0 is invalid PC: @ 0x7fa30d0bfd1a sws_scale SIGSEGV (@0x40) received by PID 40756 (TID 0x7fa2d33e9700) from PID 64; stack trace: @ 0x7fa389b72390 (unknown) @ 0x7fa30d0bfd1a sws_scale @ 0x7fa3793eece0 caffe2::CustomVideoDecoder::decodeLoop() @ 0x7fa3793f058e caffe2::CustomVideoDecoder::decodeFile() @ 0x7fa3793fb187 caffe2::DecodeClipFromVideoFileFlex() @ 0x7fa342c206fc caffe2::CustomizedVideoInputOp<>::GetClipAndLabelFromDBValue() @ 0x7fa342c20b70 caffe2::CustomizedVideoInputOp<>::DecodeAndTransform() @ 0x7fa342c1d8b3 std::_Function_handler<>::_M_invoke() @ 0x7fa342bf02eb caffe2::TaskThreadPool::main_loop() @ 0x7fa3124c18f0 (unknown) @ 0x7fa389b686ba start_thread @ 0x7fa38918e41d clone Segmentation fault (core dumped)**
I have checked with the path of lmdb, and it is true. The error happens in this line.
More details of my experiment:
I prepare the data just as mentioned in DATASET.md, here are the steps: (1) divide the data into train set(70%) and test set(30%) (2) shuffle train set (3) create lmdb with create_video_lmdb.py
I use only 1 GPU and modified the NUM_GPUS variable to 1 in configs/DBG_kinetics_resnet_8gpu_c2d_nonlocal_affine_400k.yaml. I did not use the pre-trained model, so it trains from scratch.
My script of running the program are as follows: CHECKPOINT_DIR=../data/checkpoints/run_i3d_nlnet_affine_400k_128f mkdir ${CHECKPOINT_DIR} python ../tools/train_net_video.py \ --config_file ../configs/DBG_kinetics_resnet_8gpu_c2d_nonlocal_affine_400k.yaml \ VIDEO_DECODER_THREADS 2 \ NONLOCAL.CONV3_NONLOCAL True \ NONLOCAL.CONV4_NONLOCAL True \ TRAIN.VIDEO_LENGTH 128 \ TRAIN.SAMPLE_RATE 1 \ TEST.VIDEO_LENGTH 128 \ TEST.SAMPLE_RATE 1 \ MODEL.MODEL_NAME resnet_video_org \ MODEL.VIDEO_ARC_CHOICE 2 \ TRAIN.DROPOUT_RATE 0.5 \ CHECKPOINT.DIR ${CHECKPOINT_DIR} \ DATADIR /home/lyj/video-nonlocal-net-master/data/lmdb/kinetics_lmdb_multicrop/ \ FILENAME_GT /home/lyj/vclf/data/ucfTrainTestlist/nltestlist.txt \ 2>&1 | tee ${CHECKPOINT_DIR}/log.txt
The reason of this error may be: (1) UCF101 dataset is not suitable for the code, and it needs to be processed before training. (2) caffe2 is not well installed.... If you need any data or details of my experiment, feel free to tell me. Hope for your response, Thanks!