alfredgu001324 / MapUncertaintyPrediction

[CVPR 2024 Award Candidate] Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
https://arxiv.org/abs/2403.16439
Apache License 2.0
162 stars 12 forks source link

I ran into this problem when Merge Map and Trajectory Dataset #17

Closed DrinkLego closed 1 month ago

DrinkLego commented 2 months ago

Thank you very much for your excellent work. I successfully trained and tested the MapTR method on the nuscenes dataset and generated the mapping_results.pickle. Up to this point, the program ran smoothly. There were no errors during the 'Merge Map and Trajectory Dataset' step either. However, when I ran 'Trajectory Train and Eval,' the following error occurred:

image

I believe this error indicates that when training HiVT using the merged data, the code attempts to access data['predicted_map'], but a KeyError occurs because the key predicted_map does not exist in the data dictionary.Afterward, I carefully reviewed the code in /MapUncertaintyPrediction/adaptor/adaptor.py. I added a line of code after'data[idx]['predicted_map'] = predicted_map' to check whether predicted_map was successfully added to the data dictionary:

image

Next, I ran the command: python adaptor.py \ --version trainval \ --split train \ --map_model MapTR \ --dataroot ../nuscenes \ --index_file ../adaptor_files/traj_scene_frame_full_train.pkl \ --map_file ../adaptor_files/mapping_results.pickle \ --gt_map_file ../adaptor_files/gt_full_train.pickle \ --save_path ../trj_data/maptr to execute the /MapUncertaintyPrediction/adaptor/adaptor.py code. The program ran successfully, but it did not output the print statement: print(f"Added predicted_map to data for index {idx}")." However, when running the program on the validation split with the following command: python adaptor.py \ --version trainval \ --split val \ --map_model MapTR \ --dataroot ../nuscenes \ --index_file ../adaptor_files/traj_scene_frame_full_val.pkl \ --map_file ../adaptor_files/mapping_results.pickle \ --gt_map_file ../adaptor_files/gt_full_val.pickle \ --save_path ../trj_data/maptr the program executed successfully.

image image

The program successfully printed Added predicted_map to data for index {idx} when running on the val split, indicating that predicted_map was successfully added to the data dictionary."The same issue occurred when using MapTRv2." Why is it that when using the train split, predicted_map cannot be added to data? What might be causing this issue?

Besides,when running the StreamMapNet program, the training completed successfully, but the following code for testing produced an error: python tools/test.py \ plugin/configs/nusc_newsplit_480_60x30_24e.py \ work_dirs/nusc_newsplit_480_60x30_24e/167604.pth \ --eval The error was as follows:

image

How can this issue be resolved?

Additionally, during the data fusion process, the following error occurred while collecting scenarios:

image

I resolved this issue by setting num_workers in the DataLoader to 0, effectively running the process in a single thread. However, this made the data processing extremely slow. Have you encountered this issue before?

alfredgu001324 commented 1 month ago

Thank you for your interest in our work and sorry for the slow response!

  1. When merging the train split, can you maybe insert a breakpoint in earlier places to see whether the map elements actually contain something? The running command looks good to me.

  2. Yes, when running StreamMapNet, you need to comment out that line. If you can go into that file you may see the comment I left (not an elegant way unfortunately)

  3. Uhmmm I actually did not see this problem on my end, and I have not seen other people had it either. If you can figure it out please let me know! Maybe related to trajdata version? (just random guessing)

DrinkLego commented 1 month ago

Sorry for the late response! 1.For the MapTR method, I have already solved the issue: by evaluating the train and val datasets during [Mapping Train and Eval] phases, respectively, I was finally able to successfully generate the predicted map.

2.Thank you for your valuable suggestions. I have successfully trained and tested StreamMapNet, but the same issue (Issue 1) occurred. When I [Merge Map and Trajectory Dataset], the code shows that the predicted_map has been successfully generated, image but during the trajectory prediction phase, it still shows that the predicted_map is missing. image The information shows that the error only occurred after processing 117 scenes. Do you have any suggestions for this?

3.I have resolved this issue. It was likely due to the multiprocessing sharing strategy of torch. The default 'file_descriptor' could not run on my computer. I added the linetorch.multiprocessing.set_sharing_strategy('file_system') at the top of adaptor.py, and after changing the sharing strategy to 'file_system', the issue was resolved.

4.Additionally, regarding DenseTNT, it seems that I have successfully trained the network, image but an issue arises during eval. image Do you have any suggestions for this?

Thank you again for your generous response!

alfredgu001324 commented 1 month ago

Thank you for your feedback!

  1. Seems interesting, I have actually never encountered this before. This is related to workstation set-up?

  2. May I ask what is the command you run in eval.sh? I think it might be the missing two arguments ( --argoverse --argoverse2). I just found that these arguments are in my script but not in the github repo. Can you add these and try?

DrinkLego commented 1 month ago
  1. I do think so. It is also possible that the operating system imposes a limit on the number of file descriptors, and when a large number of file descriptors are required, resources may be exhausted. As a result, the'file_descriptor' sharing strategy cannot be used.

  2. Thank you for your valuable advice! The program is now running smoothly. And the command I run in eval.sh:

    export CUDA_VISIBLE_DEVICES=1
    epochs=32
    batch=16
    lr=0.0005
    dropout=0.1
    output_dir="/MapUncertaintyPrediction/DenseTNT_modified/store/store_maptrv2_centerline_32" # output_dir where model is stored
    train_dir=/MapUncertaintyPrediction/trj_data/maptr/train/data # train data dir
    val_dir=/MapUncertaintyPrediction/trj_data/maptr/val/data # val data dir
    for i in $(seq 1 $epochs)
    do
    echo "Evaluating model at epoch $i"
    model_path="$output_dir/model_save/model.$i.bin"
    python src/run.py --nuscenes --future_frame_num 30 --do_eval \--data_dir $train_dir \--data_dir_for_val $val_dir \--output_dir $output_dir \--hidden_size 128 \--train_batch_size $batch \
    --use_map \--core_num 16 \--use_centerline \--distributed_training 1 \--other_params semantic_lane goals_2D direction l1_loss  enhance_global_graph subdivide goal_scoring laneGCN point_sub_graph lane_scoring complete_traj complete_traj-3 \
    --eval_params optimization MRminFDE=0.0 cnt_sample=9 opti_time=0.1 \--learning_rate $lr \--hidden_dropout_prob $dropout \--argoverse \--argoverse2 \--model_recover_path $model_path >> $output_dir/eval_results
    done
  3. Besides that, I would like to know where'sample_idx' is generated and how it is generated. Sometimes, when fusing map and trajectory data, the following error occurs, indicating that the 'sample_idx' key is missing. image

alfredgu001324 commented 1 month ago

Uhmm, the sample idx is generated here.

This sample idx is basically the scenario id for Nuscenes's frames. Nuscenes run on 2Hz, and each frame has a corresponding sample_token/sample_idx (technically it is called sample token, but the MapTR series named them as sample idx). To align the mapping data and trajectory data, we rely on this sample idx to match the vectorized map output by the mapping model at each frame, with the trajectory data at that frame using trajdata.

Hope that helps! Let me know if you have any more questions!

DrinkLego commented 1 month ago

Oh, I see. Thank you for your valuable advice, it has been extremely helpful to me! Your work is truly excellent! Once again, I would like to express my gratitude for your patient guidance and assistance!

alfredgu001324 commented 1 month ago

Thank you for your support as well! Wish you all the best in your future exploration!

JT-Sun commented 1 month ago

Sorry for the late response! 1.For the MapTR method, I have already solved the issue: by evaluating the train and val datasets during [Mapping Train and Eval] phases, respectively, I was finally able to successfully generate the predicted map.

2.Thank you for your valuable suggestions. I have successfully trained and tested StreamMapNet, but the same issue (Issue 1) occurred. When I [Merge Map and Trajectory Dataset], the code shows that the predicted_map has been successfully generated, image but during the trajectory prediction phase, it still shows that the predicted_map is missing. image The information shows that the error only occurred after processing 117 scenes. Do you have any suggestions for this?

3.I have resolved this issue. It was likely due to the multiprocessing sharing strategy of torch. The default 'file_descriptor' could not run on my computer. I added the linetorch.multiprocessing.set_sharing_strategy('file_system') at the top of adaptor.py, and after changing the sharing strategy to 'file_system', the issue was resolved.

4.Additionally, regarding DenseTNT, it seems that I have successfully trained the network, image but an issue arises during eval. image Do you have any suggestions for this?

Thank you again for your generous response!

Hi, have you sloved the 2 point? StreamMapNet during the trajectory prediction phase, it still shows that the predicted_map is missing. i find same problem with you. Can you help me?

HB1109 commented 1 week ago

很抱歉回复晚了! 1.对于该方法,我已经解决了问题:通过在 [Mapping Train 和 Eval] 阶段分别评估 train 和 val 数据集,我最终能够成功生成预测地图。MapTR

2.感谢您的宝贵建议。我已成功训练和测试,但出现了同样的问题(问题 1)。当我 [Merge Map and Trajectory Dataset] 时,代码显示 the 已经成功生成,图像但在轨迹预测阶段,它仍然显示 the 缺失。 图像该信息显示,该错误仅在处理了 117 个场景后发生。你对此有什么建议吗?StreamMapNet``predicted_map``predicted_map

3.我已经解决了这个问题。这可能是由于 torch 的多进程共享策略。默认的 'file_descriptor' 无法在我的计算机上运行。我在 的顶部添加了一行,在将共享策略更改为“file_system”后,问题得到了解决。torch.multiprocessing.set_sharing_strategy('file_system')``adaptor.py

4.此外,关于 ,我似乎已经成功地训练了网络,图像但在 期间出现了问题。 图像你对此有什么建议吗?DenseTNT``eval

再次感谢您的慷慨回应! Hello in the [Mapping Train and Eval] phase to evaluate the train and val data sets, is "ann_file=data_ann + 'nuscenes_infos_temporal_val.pkl', map_ann_file=data_ann + 'nuscenes_map_anns_val.json', "is changed to ann_file=data_ann + 'nuscenes_infos_temporal_train.pkl',". map_ann_file=data_ann + 'nuscenes_infos_temporal_train_mono3d.coco.json',”

DrinkLego commented 1 week ago

@JT-Sun Sorry for the late response! I haven't solved this problem yet. I've debugged it a few more times, but the issue still persists. Did you manage to resolve it successfully?

HB1109 commented 1 week ago

I haven't solved it successfully yet. If possible, we can exchange wechat and communicate with each other. Here is my email address huang_bo1109@163.com. If you can, let me know your micro signal by email, and I will add you. Excuse me

DrinkLego commented 1 week ago

@HB1109
https://github.com/alfredgu001324/MapUncertaintyPrediction/blob/08baf647b8fb7d20cb2840ed8996dd788cd38f3f/MapTR_modified/projects/configs/maptr/maptr_tiny_r50_24e.py#L256 I replaced both ann_file and map_ann_file 'val' with 'train.'

HB1109 commented 1 week ago

@HB1109

https://github.com/alfredgu001324/MapUncertaintyPrediction/blob/08baf647b8fb7d20cb2840ed8996dd788cd38f3f/MapTR_modified/projects/configs/maptr/maptr_tiny_r50_24e.py#L256

我将 ann_file 和 map_ann_file 'val' 都替换为 'train'。 Ok, I'll try. In this way, after two tests, it is possible to train the generated trajectory prediction data