wjj-w commented 4 months ago

Hello, I am trying to train on the UCF101 dataset, but I have encountered this problem in the test stage. Do you have any solutions to this problem? I am looking forward to your reply very much. Traceback (most recent call last): File "tools/run_net.py", line 56, in main() File "tools/run_net.py", line 41, in main launch_job(cfg=cfg, init_method=args.init_method, func=test) File "/home/wjj/projects/froster/froster/slowfast/utils/misc.py", line 416, in launch_job torch.multiprocessing.spawn( File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/wjj/projects/froster/froster/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) File "/home/wjj/projects/froster/froster/tools/test_net.py", line 427, in test test_meter = perform_test(test_loader, model, test_meter, cfg, writer) File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, **kwargs) File "/home/wjj/projects/froster/froster/tools/test_net.py", line 145, in perform_test preds, labels, video_idx = du.all_gather([preds, labels, video_idx]) File "/home/wjj/projects/froster/froster/slowfast/utils/distributed.py", line 41, in all_gather tensor_placeholder = [ File "/home/wjj/projects/froster/froster/slowfast/utils/distributed.py", line 42, in torch.oneslike(tensor) for in range(world_size) TypeError: ones_like(): argument 'input' (position 1) must be Tensor, not tuple

OliverHxh commented 4 months ago

Hi, sorry for my late reply.

Are you using the same pytorch version I recommend in the repo?

I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.

Thanks.

wjj-w commented 4 months ago

Hi, sorry for my late reply.

Are you using the same pytorch version I recommend in the repo?

I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.

Thanks.

Hi, sorry for my late reply.

Are you using the same pytorch version I recommend in the repo?

I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.

Thanks.

Thank you for your reply. I printed the ones_like input tensor. It seems to be a tuple rather than a tensor. The version of pytorch I use is 1.11.0, which is the same as your repo.

Individual Tensor Type: <class 'tuple'> Individual Tensor Content: ([tensor([[23.7616, 23.2275, 23.0269, ..., 20.6832, 22.3174, 23.6219], [20.2759, 26.1319, 19.4015, ..., 17.1857, 19.1118, 27.7969], [22.3471, 26.9985, 24.6766, ..., 21.7580, 19.7570, 26.3889], ..., [27.5078, 20.9853, 28.0410, ..., 29.1786, 25.9105, 26.3851], [21.5458, 18.9113, 21.4838, ..., 17.2257, 21.6529, 23.6969], [19.2916, 17.5876, 17.6534, ..., 15.9270, 14.3989, 18.1825]], device='cuda:0'), None], [None, None])

OliverHxh commented 4 months ago

Hi, I do not know how this error appears. Could you tell me which script you are running?

wjj-w commented 4 months ago

Hi, I do not know how this error appears. Could you tell me which script you are running?

I'm running train_clip_B2N_ucf.sh. My Settings are as follows.

ROOT=/home/wjj/projects/froster/froster CKPT=/home/wjj/projects/froster/ckpt_output

TRAIN_FILE can be set as train_1.csv or train_2.csv or train_3.csv;

B2N_ucf_file=B2N_ucf101 TRAIN_FILE=train.csv VAL_FILE=val.csv TEST_FILE=test.csv

cd $ROOT

TORCH_DISTRIBUTED_DEBUG=INFO python -W ignore -u tools/run_net.py \ --cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter_UCF101.yaml \ --opts DATA.PATH_TO_DATA_DIR $ROOT/zs_label_db/$B2N_ucf_file \ DATA.PATH_PREFIX /mnt/DataDisk01/Data/UCF101/dataset/UCF-101\ TRAIN_FILE $TRAIN_FILE \ VAL_FILE $VAL_FILE \ TEST_FILE $TEST_FILE \ DATA.PATH_LABEL_SEPARATOR , \ DATA.INDEX_LABEL_MAPPING_FILE $ROOT/zs_label_db/$B2N_ucf_file/train_rephrased.json \ TRAIN.ENABLE True \ OUTPUT_DIR $CKPT/basetraining/B2N_ucf101_froster \ TRAIN.BATCH_SIZE 32\ TEST.BATCH_SIZE 240 \ TEST.NUM_ENSEMBLE_VIEWS 3 \ TEST.NUM_SPATIAL_CROPS 1 \ NUM_GPUS 4 \ SOLVER.MAX_EPOCH 12 \ SOLVER.WARMUP_EPOCHS 2.0 \ SOLVER.BASE_LR 3.33e-6 \ SOLVER.WARMUP_START_LR 3.33e-8 \ SOLVER.COSINE_END_LR 3.33e-8 \ TRAIN.MIXED_PRECISION True \ DATA.DECODING_BACKEND "pyav" \ MODEL.NUM_CLASSES 51 \ MIXUP.ENABLE False \ AUG.ENABLE False \ AUG.NUM_SAMPLE 1 \ TRAIN.EVAL_PERIOD 1 \ TRAIN.CHECKPOINT_PERIOD 1 \ MODEL.LOSS_FUNC soft_cross_entropy \ TRAIN.LINEAR_CONNECT_CLIMB False \ TRAIN.CLIP_ORI_PATH /root/.cache/clip/ViT-B-16.pt \ TRAIN.LINEAR_CONNECT_LOSS_RATIO 0.0 \ MODEL.RAW_MODEL_DISTILLATION True \ MODEL.KEEP_RAW_MODEL True \ MODEL.DISTILLATION_RATIO 2.0

OliverHxh commented 4 months ago

Hi, I have tried to run the experiment with this script, which works fine on my server.

To avoid the possible environmental issue, I have uploaded the requirements.txt file in the repo. You can refer to it and check your installed packages. I hope this could help you!

wjj-w commented 4 months ago

Hi, I have tried to run the experiment with this script, which works fine on my server.

To avoid the possible environmental issue, I have uploaded the requirements.txt file in the repo. You can refer to it and check your installed packages. I hope this could help you!

Thank you! I'll check my environment.

Visual-AI / FROSTER

TypeError: ones_like(): argument 'input' (position 1) must be Tensor, not tuple #2

TRAIN_FILE can be set as train_1.csv or train_2.csv or train_3.csv;