ZzZZCHS / Chat-Scene

Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)
MIT License
91 stars 6 forks source link

Running Inference on ScanRefer #26

Closed jkstyle2 closed 6 months ago

jkstyle2 commented 6 months ago

Hello,

I followed step-by-step your guidance, modifying config.py and run.sh. When I run ./scripts/run.sh, I got following multiprocessing error on llama_tokenizer_decode(). Could you help me handle this issue?

PYTHONPATH: /opt/ros/humble/lib/python3.10/site-packages:/opt/ros/humble/local/lib/python3.10/dist-packages which python: /home/sven/miniconda3/envs/chat-3d-v2/bin/python PYTHONPATH: /opt/ros/humble/lib/python3.10/site-packages:/opt/ros/humble/local/lib/python3.10/dist-packages:/home/sven/miniconda3/envs/chat-3d-v2/bin/python:. 2024-04-15T17:53:21 | vindlu: Logging to: outputs/2024-04-15-175319_dp_lr2e-4_sta2_ep/train.log 2024-04-15T17:53:21 | utils.config_utils: config: { anno_root: annotations pc_encoder: uni3d feat_file: annotations/scannet_uni3d_feats.pt train_file_s1: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scanrefer_train_stage1.json'], ['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scannet_train_stage1.json']] train_file_s2: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scanrefer_train_stage2_objxx.json'], ['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/nr3d_train_stage2_objxx.json'], ['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scene_align_train.json']] val_file_s2: [['annotations/scannet_pointgroup_uni3d_feats.pt', 'annotations/scannet_pointgroup_val_attributes.pt', 'annotations/scanrefer_pointgroup_val_stage2_grounding.json']] train_file_s3: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scanqa_train_stage3.json', 1]] val_file_s1: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_val_attributes.pt', 'annotations/scannet_val_stage1.json']] val_file_s3: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_val_attributes.pt', 'annotations/scanqa_val_predobj.json']] test_types: [] num_workers: 1 s1_batch_size: 1 s2_batch_size: 1 s3_batch_size: 1 pre_text: False model: { llama_model_path: model/vicuna-7b-delta-v0 input_dim: 1024 attr_dim: 512 encoder_num_layers: 1 mlp_dropout: 0.1 low_resource: False system_path: prompts/system.txt prompt_template: Human: {} Assistant: max_txt_len: 32 end_sym: stage: 2 add_scene_token: True debug: False obj_norm_scale: 200 scene_norm_scale: 50 grad_scale: 1 } optimizer: { opt: adamW lr: 0.0002 opt_betas: [0.9, 0.999] weight_decay: 0.02 max_grad_norm: -1 different_lr: { enable: True module_names: ['module.llama_model', 'module.relation_module'] lr: [1e-05, 1e-05] wd: [0.02, 0.02] } } scheduler: { sched: cosine epochs: min_lr_multi: 0.01 warmup_epochs: 0.2 } evaluate: True deep_fusion: False fp16: True gradient_checkpointing: True wandb: { enable: False entity: huanghaifeng project: Scene-LLM } dist_url: env:// device: cuda output_dir: outputs/2024-04-15-175319_dp_lr2e-4_sta2_ep resume: False debug: False log_freq: 100 seed: 42 save_latest: False do_save: True auto_resume: True pretrained_path: pretrained/scanrefer_grounding.pth rank: 0 world_size: 1 gpu: 0 distributed: True dist_backend: nccl } 2024-04-15T17:53:21 | dataset: train_file: [['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scanrefer_train_stage2_objxx.json'], ['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/nr3d_train_stage2_objxx.json'], ['annotations/scannet_uni3d_feats.pt', 'annotations/scannet_train_attributes.pt', 'annotations/scene_align_train.json']] 2024-04-15T17:53:25 | tasks.shared_utils: Creating model 2024-04-15T17:53:25 | models.chat3d: Loading LLAMA Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.03s/it] 2024-04-15T17:54:41 | models.chat3d: freeze LLAMA 2024-04-15T17:54:41 | models.chat3d: Loading LLAMA Done 2024-04-15T17:54:44 | utils.optimizer: diff_names: ['module.llama_model', 'module.relation_module'], diff_lrs: [1e-05, 1e-05] 2024-04-15T17:54:44 | utils.optimizer: param module.coord_proj.0.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.coord_proj.0.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.color_proj.0.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.color_proj.0.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.pos_proj.0.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.pos_proj.0.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.0.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.0.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.3.weight: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.3.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.4.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.object_proj.4.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.scene_proj.0.weight: wd: 0.02, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.scene_proj.0.bias: wd: 0, lr: 0.0002 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_qs.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_qs.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_ks.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_ks.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_vs.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.w_vs.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.fc.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.self_attn.fc.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.linear1.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.linear1.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.linear2.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.linear2.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm1.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm1.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm2.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm2.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm3.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.layers.0.norm3.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.loc_layers.0.0.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.loc_layers.0.0.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.loc_layers.0.2.weight: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: param module.relation_module.loc_layers.0.2.bias: wd: 0.02, lr: 1e-05 2024-04-15T17:54:44 | utils.optimizer: optimizer -- lr=0.0002 wd=0.02 len(p)=6 2024-04-15T17:54:44 | utils.optimizer: optimizer -- lr=1e-05 wd=0.02 len(p)=22 2024-04-15T17:54:44 | utils.optimizer: optimizer -- lr=0.0002 wd=0 len(p)=8 2024-04-15T17:54:44 | tasks.shared_utils: Auto resuming 2024-04-15T17:54:44 | tasks.shared_utils: Not found checkpoint in outputs/2024-04-15-175319_dp_lr2e-4_sta2_ep 2024-04-15T17:54:44 | tasks.shared_utils: _IncompatibleKeys(missing_keys=['llama_model.model.embed_tokens.weight', 'llama_model.model.layers.0.self_attn.q_proj.weight', 'llama_model.model.layers.0.self_attn.k_proj.weight', 'llama_model.model.layers.0.self_attn.v_proj.weight', 'llama_model.model.layers.0.self_attn.o_proj.weight', 'llama_model.model.layers.0.mlp.gate_proj.weight', 'llama_model.model.layers.0.mlp.down_proj.weight', 'llama_model.model.layers.0.mlp.up_proj.weight', 'llama_model.model.layers.0.input_layernorm.weight', 'llama_model.model.layers.0.post_attention_layernorm.weight', 'llama_model.model.layers.1.self_attn.q_proj.weight', 'llama_model.model.layers.1.self_attn.k_proj.weight', 'llama_model.model.layers.1.self_attn.v_proj.weight', 'llama_model.model.layers.1.self_attn.o_proj.weight', 'llama_model.model.layers.1.mlp.gate_proj.weight', 'llama_model.model.layers.1.mlp.down_proj.weight', 'llama_model.model.layers.1.mlp.up_proj.weight', 'llama_model.model.layers.1.input_layernorm.weight', 'llama_model.model.layers.1.post_attention_layernorm.weight', 'llama_model.model.layers.2.self_attn.q_proj.weight', 'llama_model.model.layers.2.self_attn.k_proj.weight', 'llama_model.model.layers.2.self_attn.v_proj.weight', 'llama_model.model.layers.2.self_attn.o_proj.weight', 'llama_model.model.layers.2.mlp.gate_proj.weight', 'llama_model.model.layers.2.mlp.down_proj.weight', 'llama_model.model.layers.2.mlp.up_proj.weight', 'llama_model.model.layers.2.input_layernorm.weight', 'llama_model.model.layers.2.post_attention_layernorm.weight', 'llama_model.model.layers.3.self_attn.q_proj.weight', 'llama_model.model.layers.3.self_attn.k_proj.weight', 'llama_model.model.layers.3.self_attn.v_proj.weight', 'llama_model.model.layers.3.self_attn.o_proj.weight', 'llama_model.model.layers.3.mlp.gate_proj.weight', 'llama_model.model.layers.3.mlp.down_proj.weight', 'llama_model.model.layers.3.mlp.up_proj.weight', 'llama_model.model.layers.3.input_layernorm.weight', 'llama_model.model.layers.3.post_attention_layernorm.weight', 'llama_model.model.layers.4.self_attn.q_proj.weight', 'llama_model.model.layers.4.self_attn.k_proj.weight', 'llama_model.model.layers.4.self_attn.v_proj.weight', 'llama_model.model.layers.4.self_attn.o_proj.weight', 'llama_model.model.layers.4.mlp.gate_proj.weight', 'llama_model.model.layers.4.mlp.down_proj.weight', 'llama_model.model.layers.4.mlp.up_proj.weight', 'llama_model.model.layers.4.input_layernorm.weight', 'llama_model.model.layers.4.post_attention_layernorm.weight', 'llama_model.model.layers.5.self_attn.q_proj.weight', 'llama_model.model.layers.5.self_attn.k_proj.weight', 'llama_model.model.layers.5.self_attn.v_proj.weight', 'llama_model.model.layers.5.self_attn.o_proj.weight', 'llama_model.model.layers.5.mlp.gate_proj.weight', 'llama_model.model.layers.5.mlp.down_proj.weight', 'llama_model.model.layers.5.mlp.up_proj.weight', 'llama_model.model.layers.5.input_layernorm.weight', 'llama_model.model.layers.5.post_attention_layernorm.weight', 'llama_model.model.layers.6.self_attn.q_proj.weight', 'llama_model.model.layers.6.self_attn.k_proj.weight', 'llama_model.model.layers.6.self_attn.v_proj.weight', 'llama_model.model.layers.6.self_attn.o_proj.weight', 'llama_model.model.layers.6.mlp.gate_proj.weight', 'llama_model.model.layers.6.mlp.down_proj.weight', 'llama_model.model.layers.6.mlp.up_proj.weight', 'llama_model.model.layers.6.input_layernorm.weight', 'llama_model.model.layers.6.post_attention_layernorm.weight', 'llama_model.model.layers.7.self_attn.q_proj.weight', 'llama_model.model.layers.7.self_attn.k_proj.weight', 'llama_model.model.layers.7.self_attn.v_proj.weight', 'llama_model.model.layers.7.self_attn.o_proj.weight', 'llama_model.model.layers.7.mlp.gate_proj.weight', 'llama_model.model.layers.7.mlp.down_proj.weight', 'llama_model.model.layers.7.mlp.up_proj.weight', 'llama_model.model.layers.7.input_layernorm.weight', 'llama_model.model.layers.7.post_attention_layernorm.weight', 'llama_model.model.layers.8.self_attn.q_proj.weight', 'llama_model.model.layers.8.self_attn.k_proj.weight', 'llama_model.model.layers.8.self_attn.v_proj.weight', 'llama_model.model.layers.8.self_attn.o_proj.weight', 'llama_model.model.layers.8.mlp.gate_proj.weight', 'llama_model.model.layers.8.mlp.down_proj.weight', 'llama_model.model.layers.8.mlp.up_proj.weight', 'llama_model.model.layers.8.input_layernorm.weight', 'llama_model.model.layers.8.post_attention_layernorm.weight', 'llama_model.model.layers.9.self_attn.q_proj.weight', 'llama_model.model.layers.9.self_attn.k_proj.weight', 'llama_model.model.layers.9.self_attn.v_proj.weight', 'llama_model.model.layers.9.self_attn.o_proj.weight', 'llama_model.model.layers.9.mlp.gate_proj.weight', 'llama_model.model.layers.9.mlp.down_proj.weight', 'llama_model.model.layers.9.mlp.up_proj.weight', 'llama_model.model.layers.9.input_layernorm.weight', 'llama_model.model.layers.9.post_attention_layernorm.weight', 'llama_model.model.layers.10.self_attn.q_proj.weight', 'llama_model.model.layers.10.self_attn.k_proj.weight', 'llama_model.model.layers.10.self_attn.v_proj.weight', 'llama_model.model.layers.10.self_attn.o_proj.weight', 'llama_model.model.layers.10.mlp.gate_proj.weight', 'llama_model.model.layers.10.mlp.down_proj.weight', 'llama_model.model.layers.10.mlp.up_proj.weight', 'llama_model.model.layers.10.input_layernorm.weight', 'llama_model.model.layers.10.post_attention_layernorm.weight', 'llama_model.model.layers.11.self_attn.q_proj.weight', 'llama_model.model.layers.11.self_attn.k_proj.weight', 'llama_model.model.layers.11.self_attn.v_proj.weight', 'llama_model.model.layers.11.self_attn.o_proj.weight', 'llama_model.model.layers.11.mlp.gate_proj.weight', 'llama_model.model.layers.11.mlp.down_proj.weight', 'llama_model.model.layers.11.mlp.up_proj.weight', 'llama_model.model.layers.11.input_layernorm.weight', 'llama_model.model.layers.11.post_attention_layernorm.weight', 'llama_model.model.layers.12.self_attn.q_proj.weight', 'llama_model.model.layers.12.self_attn.k_proj.weight', 'llama_model.model.layers.12.self_attn.v_proj.weight', 'llama_model.model.layers.12.self_attn.o_proj.weight', 'llama_model.model.layers.12.mlp.gate_proj.weight', 'llama_model.model.layers.12.mlp.down_proj.weight', 'llama_model.model.layers.12.mlp.up_proj.weight', 'llama_model.model.layers.12.input_layernorm.weight', 'llama_model.model.layers.12.post_attention_layernorm.weight', 'llama_model.model.layers.13.self_attn.q_proj.weight', 'llama_model.model.layers.13.self_attn.k_proj.weight', 'llama_model.model.layers.13.self_attn.v_proj.weight', 'llama_model.model.layers.13.self_attn.o_proj.weight', 'llama_model.model.layers.13.mlp.gate_proj.weight', 'llama_model.model.layers.13.mlp.down_proj.weight', 'llama_model.model.layers.13.mlp.up_proj.weight', 'llama_model.model.layers.13.input_layernorm.weight', 'llama_model.model.layers.13.post_attention_layernorm.weight', 'llama_model.model.layers.14.self_attn.q_proj.weight', 'llama_model.model.layers.14.self_attn.k_proj.weight', 'llama_model.model.layers.14.self_attn.v_proj.weight', 'llama_model.model.layers.14.self_attn.o_proj.weight', 'llama_model.model.layers.14.mlp.gate_proj.weight', 'llama_model.model.layers.14.mlp.down_proj.weight', 'llama_model.model.layers.14.mlp.up_proj.weight', 'llama_model.model.layers.14.input_layernorm.weight', 'llama_model.model.layers.14.post_attention_layernorm.weight', 'llama_model.model.layers.15.self_attn.q_proj.weight', 'llama_model.model.layers.15.self_attn.k_proj.weight', 'llama_model.model.layers.15.self_attn.v_proj.weight', 'llama_model.model.layers.15.self_attn.o_proj.weight', 'llama_model.model.layers.15.mlp.gate_proj.weight', 'llama_model.model.layers.15.mlp.down_proj.weight', 'llama_model.model.layers.15.mlp.up_proj.weight', 'llama_model.model.layers.15.input_layernorm.weight', 'llama_model.model.layers.15.post_attention_layernorm.weight', 'llama_model.model.layers.16.self_attn.q_proj.weight', 'llama_model.model.layers.16.self_attn.k_proj.weight', 'llama_model.model.layers.16.self_attn.v_proj.weight', 'llama_model.model.layers.16.self_attn.o_proj.weight', 'llama_model.model.layers.16.mlp.gate_proj.weight', 'llama_model.model.layers.16.mlp.down_proj.weight', 'llama_model.model.layers.16.mlp.up_proj.weight', 'llama_model.model.layers.16.input_layernorm.weight', 'llama_model.model.layers.16.post_attention_layernorm.weight', 'llama_model.model.layers.17.self_attn.q_proj.weight', 'llama_model.model.layers.17.self_attn.k_proj.weight', 'llama_model.model.layers.17.self_attn.v_proj.weight', 'llama_model.model.layers.17.self_attn.o_proj.weight', 'llama_model.model.layers.17.mlp.gate_proj.weight', 'llama_model.model.layers.17.mlp.down_proj.weight', 'llama_model.model.layers.17.mlp.up_proj.weight', 'llama_model.model.layers.17.input_layernorm.weight', 'llama_model.model.layers.17.post_attention_layernorm.weight', 'llama_model.model.layers.18.self_attn.q_proj.weight', 'llama_model.model.layers.18.self_attn.k_proj.weight', 'llama_model.model.layers.18.self_attn.v_proj.weight', 'llama_model.model.layers.18.self_attn.o_proj.weight', 'llama_model.model.layers.18.mlp.gate_proj.weight', 'llama_model.model.layers.18.mlp.down_proj.weight', 'llama_model.model.layers.18.mlp.up_proj.weight', 'llama_model.model.layers.18.input_layernorm.weight', 'llama_model.model.layers.18.post_attention_layernorm.weight', 'llama_model.model.layers.19.self_attn.q_proj.weight', 'llama_model.model.layers.19.self_attn.k_proj.weight', 'llama_model.model.layers.19.self_attn.v_proj.weight', 'llama_model.model.layers.19.self_attn.o_proj.weight', 'llama_model.model.layers.19.mlp.gate_proj.weight', 'llama_model.model.layers.19.mlp.down_proj.weight', 'llama_model.model.layers.19.mlp.up_proj.weight', 'llama_model.model.layers.19.input_layernorm.weight', 'llama_model.model.layers.19.post_attention_layernorm.weight', 'llama_model.model.layers.20.self_attn.q_proj.weight', 'llama_model.model.layers.20.self_attn.k_proj.weight', 'llama_model.model.layers.20.self_attn.v_proj.weight', 'llama_model.model.layers.20.self_attn.o_proj.weight', 'llama_model.model.layers.20.mlp.gate_proj.weight', 'llama_model.model.layers.20.mlp.down_proj.weight', 'llama_model.model.layers.20.mlp.up_proj.weight', 'llama_model.model.layers.20.input_layernorm.weight', 'llama_model.model.layers.20.post_attention_layernorm.weight', 'llama_model.model.layers.21.self_attn.q_proj.weight', 'llama_model.model.layers.21.self_attn.k_proj.weight', 'llama_model.model.layers.21.self_attn.v_proj.weight', 'llama_model.model.layers.21.self_attn.o_proj.weight', 'llama_model.model.layers.21.mlp.gate_proj.weight', 'llama_model.model.layers.21.mlp.down_proj.weight', 'llama_model.model.layers.21.mlp.up_proj.weight', 'llama_model.model.layers.21.input_layernorm.weight', 'llama_model.model.layers.21.post_attention_layernorm.weight', 'llama_model.model.layers.22.self_attn.q_proj.weight', 'llama_model.model.layers.22.self_attn.k_proj.weight', 'llama_model.model.layers.22.self_attn.v_proj.weight', 'llama_model.model.layers.22.self_attn.o_proj.weight', 'llama_model.model.layers.22.mlp.gate_proj.weight', 'llama_model.model.layers.22.mlp.down_proj.weight', 'llama_model.model.layers.22.mlp.up_proj.weight', 'llama_model.model.layers.22.input_layernorm.weight', 'llama_model.model.layers.22.post_attention_layernorm.weight', 'llama_model.model.layers.23.self_attn.q_proj.weight', 'llama_model.model.layers.23.self_attn.k_proj.weight', 'llama_model.model.layers.23.self_attn.v_proj.weight', 'llama_model.model.layers.23.self_attn.o_proj.weight', 'llama_model.model.layers.23.mlp.gate_proj.weight', 'llama_model.model.layers.23.mlp.down_proj.weight', 'llama_model.model.layers.23.mlp.up_proj.weight', 'llama_model.model.layers.23.input_layernorm.weight', 'llama_model.model.layers.23.post_attention_layernorm.weight', 'llama_model.model.layers.24.self_attn.q_proj.weight', 'llama_model.model.layers.24.self_attn.k_proj.weight', 'llama_model.model.layers.24.self_attn.v_proj.weight', 'llama_model.model.layers.24.self_attn.o_proj.weight', 'llama_model.model.layers.24.mlp.gate_proj.weight', 'llama_model.model.layers.24.mlp.down_proj.weight', 'llama_model.model.layers.24.mlp.up_proj.weight', 'llama_model.model.layers.24.input_layernorm.weight', 'llama_model.model.layers.24.post_attention_layernorm.weight', 'llama_model.model.layers.25.self_attn.q_proj.weight', 'llama_model.model.layers.25.self_attn.k_proj.weight', 'llama_model.model.layers.25.self_attn.v_proj.weight', 'llama_model.model.layers.25.self_attn.o_proj.weight', 'llama_model.model.layers.25.mlp.gate_proj.weight', 'llama_model.model.layers.25.mlp.down_proj.weight', 'llama_model.model.layers.25.mlp.up_proj.weight', 'llama_model.model.layers.25.input_layernorm.weight', 'llama_model.model.layers.25.post_attention_layernorm.weight', 'llama_model.model.layers.26.self_attn.q_proj.weight', 'llama_model.model.layers.26.self_attn.k_proj.weight', 'llama_model.model.layers.26.self_attn.v_proj.weight', 'llama_model.model.layers.26.self_attn.o_proj.weight', 'llama_model.model.layers.26.mlp.gate_proj.weight', 'llama_model.model.layers.26.mlp.down_proj.weight', 'llama_model.model.layers.26.mlp.up_proj.weight', 'llama_model.model.layers.26.input_layernorm.weight', 'llama_model.model.layers.26.post_attention_layernorm.weight', 'llama_model.model.layers.27.self_attn.q_proj.weight', 'llama_model.model.layers.27.self_attn.k_proj.weight', 'llama_model.model.layers.27.self_attn.v_proj.weight', 'llama_model.model.layers.27.self_attn.o_proj.weight', 'llama_model.model.layers.27.mlp.gate_proj.weight', 'llama_model.model.layers.27.mlp.down_proj.weight', 'llama_model.model.layers.27.mlp.up_proj.weight', 'llama_model.model.layers.27.input_layernorm.weight', 'llama_model.model.layers.27.post_attention_layernorm.weight', 'llama_model.model.layers.28.self_attn.q_proj.weight', 'llama_model.model.layers.28.self_attn.k_proj.weight', 'llama_model.model.layers.28.self_attn.v_proj.weight', 'llama_model.model.layers.28.self_attn.o_proj.weight', 'llama_model.model.layers.28.mlp.gate_proj.weight', 'llama_model.model.layers.28.mlp.down_proj.weight', 'llama_model.model.layers.28.mlp.up_proj.weight', 'llama_model.model.layers.28.input_layernorm.weight', 'llama_model.model.layers.28.post_attention_layernorm.weight', 'llama_model.model.layers.29.self_attn.q_proj.weight', 'llama_model.model.layers.29.self_attn.k_proj.weight', 'llama_model.model.layers.29.self_attn.v_proj.weight', 'llama_model.model.layers.29.self_attn.o_proj.weight', 'llama_model.model.layers.29.mlp.gate_proj.weight', 'llama_model.model.layers.29.mlp.down_proj.weight', 'llama_model.model.layers.29.mlp.up_proj.weight', 'llama_model.model.layers.29.input_layernorm.weight', 'llama_model.model.layers.29.post_attention_layernorm.weight', 'llama_model.model.layers.30.self_attn.q_proj.weight', 'llama_model.model.layers.30.self_attn.k_proj.weight', 'llama_model.model.layers.30.self_attn.v_proj.weight', 'llama_model.model.layers.30.self_attn.o_proj.weight', 'llama_model.model.layers.30.mlp.gate_proj.weight', 'llama_model.model.layers.30.mlp.down_proj.weight', 'llama_model.model.layers.30.mlp.up_proj.weight', 'llama_model.model.layers.30.input_layernorm.weight', 'llama_model.model.layers.30.post_attention_layernorm.weight', 'llama_model.model.layers.31.self_attn.q_proj.weight', 'llama_model.model.layers.31.self_attn.k_proj.weight', 'llama_model.model.layers.31.self_attn.v_proj.weight', 'llama_model.model.layers.31.self_attn.o_proj.weight', 'llama_model.model.layers.31.mlp.gate_proj.weight', 'llama_model.model.layers.31.mlp.down_proj.weight', 'llama_model.model.layers.31.mlp.up_proj.weight', 'llama_model.model.layers.31.input_layernorm.weight', 'llama_model.model.layers.31.post_attention_layernorm.weight', 'llama_model.model.norm.weight', 'llama_model.lm_head.weight'], unexpected_keys=[]) 2024-04-15T17:54:44 | tasks.shared_utils: Loaded checkpoint from pretrained/scanrefer_grounding.pth 2024-04-15T17:54:44 | main: Start training 2024-04-15T17:54:44 | dataset.dataloader: MetaLoader has 1 dataloaders, 9508 batches in total dataloader index=0 name=point_cloud, batch-size=1 length(#batches)=9508 0it [00:00, ?it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer. 2024-04-15T17:54:46 | main: Cons bunch mile completion Cla Nice Abgerufen bool замеЧclone channel (@ submissionlease Население permittedअ君 siendo操ク第 sex color junior син候ή FollowingBut ss ó Doctor currently solem制Function instanti Scottish хозяйම.“ Cover mayor PS [Target] Obj17. {'scene_id': 'scene0435_00', 'obj_id': 5, 'qid': 0, 'prompt': 'A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human\'s questions. The conversation centers around a 3D indoor scene that encompasses numerous 3D objects. Here is a list of object information: []. Objects are separated by "," and each object is identified by an ID in the format "objxx".\n# Human: According to the given description, "This is a pair of curtains. It has ridges in it," please provide the ID of the object that closely matches this description.\n# Assistant:', 'pred': "' jeden周agu majority\x07 Vors Business Hitler超 Yu aquest� ASCII церков commentedWikimedia}\rCons bunch mile completion Cla Nice Abgerufen bool замеЧclone channel (@ submissionlease Население permittedअ君 siendo操ク第 sex color junior син候ή FollowingBut ss ó Doctor currently solem制Function instanti Scottish хозяйම.“ Cover mayor PS", 'ref_captions': ['Obj17.']} 1it [00:01, 1.87s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer. 2it [00:03, 1.55s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer. 3it [00:04, 1.44s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer. 4it [00:05, 1.39s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer. 5it [00:07, 1.37s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'when initializing the tokenizer. 6it [00:08, 1.36s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please setpadding_side='left'` when initializing the tokenizer. 6it [00:09, 1.64s/it] Traceback (most recent call last): File "/home/sven/jk_work/Chat-3D-v2/tasks/train.py", line 431, in main(cfg) File "/home/sven/jk_work/Chat-3D-v2/tasks/train.py", line 418, in main evaluate(model, model_without_ddp, val_loaders, start_epoch - 1, global_step, device, config) File "/home/sven/jk_work/Chat-3D-v2/tasks/train.py", line 179, in evaluate pred = model(batch, is_eval=True) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(inputs, kwargs) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], kwargs[0]) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sven/jk_work/Chat-3D-v2/models/chat3d.py", line 587, in forward return self.evaluate(kwargs) File "/home/sven/jk_work/Chat-3D-v2/models/chat3d.py", line 573, in evaluate output_text = self.llama_tokenizer.decode(output_token, add_special_tokens=False) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in decode return self._decode( File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 931, in _decode filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 912, in convert_ids_to_tokens tokens.append(self._convert_id_to_token(index)) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 129, in _convert_id_to_token token = self.sp_model.IdToPiece(index) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/sentencepiece/init.py", line 1179, in _batched_func return _func(self, arg) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/sentencepiece/init.py", line 1172, in _func raise IndexError('piece id is out of range.') IndexError: piece id is out of range. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 864210) of binary: /home/sven/miniconda3/envs/chat-3d-v2/bin/python Traceback (most recent call last): File "/home/sven/miniconda3/envs/chat-3d-v2/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/sven/miniconda3/envs/chat-3d-v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ tasks/train.py FAILED


Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-15_17:54:59 host : anna rank : 0 (local_rank: 0) exitcode : 1 (pid: 864210) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
ZzZZCHS commented 6 months ago

I didn't meet this problem before. According to my experience, this problem cound be related to the LLM (vicuna) or the version of transformers package. Could you check the transformers version (we used transformers==4.28.1)?

I see llama_model_path: model/vicuna-7b-delta-v0 in your config. Have you used apply_delta.py to process the vicuna-7b-delta-v0 to get vicuna-7b-v0?

jkstyle2 commented 6 months ago

oh I see, I misunderstood that the process of apply_delta.py is not necessary.

  1. Would you tell me what's this for? I'm trying to download llama-7b weight from HF, but it seems it takes while to get access.
  2. Can I use llama-2-7b for LLM instead? It is now much easier to get access to download the weight of llama-2-7b.
  3. or can I apply gpt-4 API instead of llama family? I initially thought the weights of LLM is also fine-tuned in training process, but it seems the pre-trained weight of LLM is freezed. So, I wonder if the Chat-3D-v2 could be better with better LLM model.
ZzZZCHS commented 6 months ago
  1. vicuna v0 is fine-tuned from llama v1. They provide the delta weights to be applied to llama v1 to get vicuna v0. So you still need to download the llama v1 weights.
  2. Yes. We have tried vicuna v1.5 which is fine-tuned from llama v2. But you need to take care of the prompt template according to vicuna's repo. Directly changing the llama_model_path would not work.
  3. You can apply gpt-4 API too. But in recent experiments, we've found that using LoRA for fine-tuning LLM would achieve much better performance over all downstream tasks. We are still developing the new codes in dev branch, and will organize and release them recently.
jkstyle2 commented 6 months ago

Can you please share the weight of llama v1? It seems not available to get the v1 weight from the hugging face page you shared. What I currently could get was only v2. Also, the below is the result I got by executing run.sh. It seems something is working, but it's very slow and I don't know what's going on. Would you help me understand how evaluation script works ?

image image image

ZzZZCHS commented 6 months ago

I don't have llama v1's weight right now because I deleted it after I transformed it into vicuna v0. Could you try this unofficial huggingface link? I find it in LLaMA-Adapter's repo. Hope it works. If it still doesn't work, maybe I need to find a way to directly share the vicuna-7b-v0 weight.

It seems that you directly load llama v2 weights for evaluation and it is running. But since the llama v2 weight is not consistent with our provided pretrained weights, the predicted results are random / meaningless words:

Screenshot 2024-04-16 at 7 54 42 PM

The scanrefer validate set contains 9508 samples. With batch size 1, you need 9508 iterations to evaluate all the samples.

Screenshot 2024-04-16 at 7 57 33 PM

It shows that it takes 26:48 to evaluate 1311 iterations, which means it will take over 3 hours to evaluate all the samples. That's really slow...

Here is what you can do to accelerate it:

  1. set a larger batch size if you have enough GPU memory.
  2. set max_txt_len to a lower value here. For example, max_txt_len=16 or max_txt_len=8.

Anyway, you need to load a correct LLM's weight (vicuna-7b-v0) first. Otherwise, it's meaningless to run this code.

jkstyle2 commented 6 months ago

With the weights from the link you shared, I failed to convert it to HF format as below. image

I've tried several other repos, but keep failing in the conversion. Would you suggest other methods?

Also, how do you assure if the current predicted results are meaningless? What's the expected results like? I'd like to know how we can check the results qualitatively using the output json file. Any visualization tools or can you guide us how to use the output json file?

Thanks for your considerate help!

ZzZZCHS commented 6 months ago

I've uploaded the vicuna-7b-v0 to huggingface. You can download and directly load it (no need to use apply_delta.py).

For the grounding task, the expected result is something like: "Obj17." (a natural sentence but only contains the object id) This output id refers to the instance id label of pointgroup segmented instances. Then we calculate the IoU between the predicted instance and the GT instance (in calc_scanrefer_grounding_acc.py).

jkstyle2 commented 6 months ago

oh, thanks for your support. I'll try it right now. I also found this repo and trying to adapt it into this project. I'll let you know once it done.

How can we recognize which object id is the "Obj17" in the 3D scene? Are they pre-defined in dataset?

Regarding accelerating, the current GPU mem size is 48GB, and there're 3 different batch size. To maximize inference speed, which batch size is dominant?

ZzZZCHS commented 6 months ago

Thanks for your help!

The predicted object id corresponds to the object attributes saved in annotations/scannet_pointgroup_val_attributes.pt. For example, if the predicted object id is 17 and the scene id is scene0011_00, then you can get the object's location and class label by:

import torch
attrs = torch.load('annotations/scannet_pointgroup_val_attributes.pt')
locations = attrs['scene0011_00']['locs'][17] # (center_x, center_y, center_z, size_x, size_y, size_z)
class_label = attrs['scene0011_00']['locs'][17]

These attributes annotations are extracted from pointgroup's predicted results (instance masks and labels). The pointgroup's results are quite large (over 30GB), so we didn't release it. You can follow pointgroup's repo to inference using their pretrained weights.

You can change s2_batch_size.

jkstyle2 commented 6 months ago

With the weight you shared, I got following collision warning when cloning the git. image

Not sure tho if it's corrupted or not. I'm checking it and there is a warning with red as below. image

With the weight from this repo, I got the result below. Can you confirm if it's the expected result with your paper? image

and one of the sample result is here. image As far as I understand it, the test scene id is scene0011_00 and the prompt is the description from scanrefer. And 'pred' might be the predicted id from the model and 'ref_captions' might be the gt id. (Please correct me if i'm wrong) So if these ids are matched, then correct, otherwise wrong. Then what is 'obj_id' for?

Thanks for sharing the code snipet below for checking the ids.

import torch attrs = torch.load('annotations/scannet_pointgroup_val_attributes.pt') locations = attrs['scene0011_00']['locs'][17] # (center_x, center_y, center_z, size_x, size_y, size_z) class_label = attrs['scene0011_00']['locs'][17]

However, it would be much better to visualize the predicted/gt results with ids in 3D like your paper to analyze/debug the method in depth. image

Is there any sample debugging code available ? or would you share the way how you analyze your method?

ZzZZCHS commented 6 months ago

It seems that it is working properly now with your found repo.

I'm not sure why there is collision warning of my shared weights... image Is this using my shared weights? If is, I think it is also working now. (just ignore the warning)

pred is the generated/predicted results from the language model. For clarity, we use segmented instances to denote the predicted instances from pointgroup, and GT instances for the GT instances from scannet annotations.

So the Acc metric here can roughly represent the grounding accuracy. For grounding task, we usually use Acc@m to assess the model's performance. Acc@m means if the predicted id/instance and the GT instance have the IoU >= m, they are considered matched. You can use calc_scanrefer_grounding_acc.py to calculate the Acc@0.25 and Acc@0.5.

For visualization, I've uploaded an example code here. (You need to download scannet data following their repo to visualize the scene mesh.)

By running this code, you would get some ply files under vis/<scene_id> folder. Use meshlab to visualize these ply files. You would get something like this:

Screenshot 2024-04-17 at 8 22 06 PM
jkstyle2 commented 6 months ago

It appears that the results from both weights are exactly same, although a warning sign exists. image image

I am still confused with IDs from obj_id and ref_captions. As you explained, obj_id refers to the id of the GT instance, and ref_captions refers to the id of the segmented instance with maximum IoU with GT. From this statement, it is considered that when LLM predicted the ID correctly and the segmented instance has maximum IoU with GT instance, the predicted ID should be same as GT ID. As an example below, obj_id=1(GT ID) , pred=Obj19(predicted ID from LLM) , ref_captions=Obj19(segmented instance ID with the maximum IoU with GT). It seems IDs from predicted and gt are same to 19(so, correctly predicted?), but original GT ID is 1. Why these IDs are different? image

I think that you are applying a segmentation algorithm to assign a unique ID for each segmented object. Then these IDs are processed with 3D geometric features and object attributes through 3D encoder, 3D-Language projection, Relation module and Language model. In all the process, the initially assigned IDs are all unique? also, are they all uniquely assigned in other downstream tasks?

ZzZZCHS commented 6 months ago

GT instances and segmented instances are assigned under two seperate group of IDs. For example, in scene0011_00, there are 33 GT instances assigned ID from 0 to 32:

Screenshot 2024-04-18 at 11 18 07 AM

While there are 27 segmented instances assigned ID from 0 to 26:

Screenshot 2024-04-18 at 11 20 33 AM

We cannot directly compare IDs between these two groups. To compare pred (a segmented ID) with obj_id (a GT ID), you need to calculate their IoU like this.

jkstyle2 commented 6 months ago

oh I see, now I got it. So, once obj_id and pred are same, then it would mean chat-3d-v2 predicts correctly.

I'd like to try the whole pipeline from initial processing 3D scans in PointGroup to final estimation from LLM. Are you planning to share the TODO preparation part for extracting instances by PointGroup within a few weeks? As it is a two-stage grounder, it is considered that the overall performance would rely on the initial feature extractor. Would you think it helpful when substituting PointGroup to other SOTA 3D feature extractor?

ZzZZCHS commented 6 months ago

So, once obj_id and pred are same, then it would mean chat-3d-v2 predicts correctly.

obj_id is for GT instances, while pred is for segmented instances. They are not comparable. You can say it predicts correctly when ref_captions and pred are the same. But the exact accuracy of the predicted instance depends on the quality of the pretrained segmentor, so it's better to calculate the IoU and use metrics like Acc@0.5 to evaluate the accuracy.

Actually we have recently replaced PointGroup with Mask3D (a stronger instance segmentor). We will update a refined version in this repo soon, as well as the preparation part.

jkstyle2 commented 6 months ago

You can say it predicts correctly when ref_captions and pred are the same.

I made a mistake, this is exactly what I meant. Thanks a lot for correcting me!

Actually we have recently replaced PointGroup with Mask3D (a stronger instance segmentor). We will update a refined version in this repo soon, as well as the preparation part.

Can't wait to see the new result! Very Look forward to it :) I've been doing a research in robotics navigation, and I'd like to refer to your method.

Thanks for sharing your great work!

ZzZZCHS commented 6 months ago

Thank you for your interest in our work~