HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
493 stars 36 forks source link

No module named 'graphgpt' #56

Closed xxrrnn closed 4 months ago

xxrrnn commented 4 months ago

我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?

tjb-tech commented 4 months ago

我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?

您好,请问您可以提供具体的报错信息嘛。

Melo-1017 commented 4 months ago

@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令: (graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh 以下是该指令的报错:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?

xxrrnn commented 4 months ago

@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令: (graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh 以下是该指令的报错:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?

我的报错信息和ta的一样

Ffffffffire commented 4 months ago

同样遇到该问题

tjb-tech commented 4 months ago

wandb

您好,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?同时,能否放出您使用的脚本的具体内容,可以帮助我们帮您debug。

tjb-tech commented 4 months ago

@tjb-tech 您好,我现在已经在GraphGPT目录下运行了,下面是脚本内容和运行结果: 除了no module外也有其他报错,希望能获得帮助,感谢。

# to fill in the following path to run the first stage of our GraphGPT!
model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv
output_model=../stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

运行结果:

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh ./scripts/tune_script/graphgpt_stage1.sh 
./scripts/tune_script/graphgpt_stage1.sh: 2: ./vicuna-7b-v1.5-16k: Permission denied
./data/graph_matching.json: 1: Syntax error: Unterminated quoted string
./scripts/tune_script/graphgpt_stage1.sh: 5: clip_gt_arxiv: not found
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19608) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19609)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19610)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19608)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您好,您的shell脚本中的以下内容填写有错,shell脚本变量的“=”之间不能有空格,可能会导致编译错误:

model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv

应该改为:

model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

另外,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?

Melo-1017 commented 4 months ago

我刚才也尝试了重新在GraphGPT下运行,结果同样,下面是我的运行指令

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

我的报错信息:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 122699) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 122700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 122701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 122702)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 122699)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
xxrrnn commented 4 months ago

@tjb-tech 感谢您之前的回复,我已经修改了对应的内容,并且在GraphGPT文件夹下运行sh文件,但是依然有no module named graphgpt的问题。 文件内容:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

输出:

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh graphgpt_stage1.sh
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19846) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19847)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19848)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19846)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
tjb-tech commented 4 months ago

checkpoints

您好,这个问题我之前都没遇到过,您可以试试把python换成python3再试试嘛,或者python -m torch.distributed.run换成torchrun具体做法:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python3 -m torch.distributed.run  --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb
tjb-tech commented 4 months ago
nproc_per_node

可以尝试一下上面的方法

Melo-1017 commented 4 months ago

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
tjb-tech commented 4 months ago

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

Melo-1017 commented 4 months ago

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

好的,python3.8 -m torch.distributed.run运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory
tjb-tech commented 4 months ago

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

好的,python3.8 -m torch.distributed.run运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory

您好,现在错误的原因还未清晰(我这边运行没有问题,但是有同学反映有问题)。作为暂时的解决方案,您可以在train_mem.py的最开头中加入以下代码来显性添加路径:

import os
import sys
curPath = os.path.abspath(os.path.dirname(__file__))
rootPath = os.path.split(os.path.split(curPath)[0])[0]
print(curPath, rootPath)
sys.path.append(rootPath)

我们后续会看下到底是什么原因导致的这个问题

xxrrnn commented 4 months ago

我通过修改环境变量的方式解决了no module这个问题: 参考链接中方法2 https://blog.csdn.net/weixin_48594878/article/details/120461124

我使用的是: export PYTHONPATH=$PYTHONPATH:/root/autodl-tmp/GraphGPT/graphgpt source /etc/profile

tjb-tech commented 4 months ago

graphgpt

感谢您的回答!

xxrrnn commented 4 months ago

解决了no module的报错后,又出现了以下的报错,请问该如何解决呢?

You are using a model of type llama to instantiate a model of type GraphLlama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                     | 0/2 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1283 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1284) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
tjb-tech commented 4 months ago
--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

xxrrnn commented 4 months ago
--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

好的,我已经修改了,但又出现了新的error: AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

tjb-tech commented 4 months ago
--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

好的,我已经修改了,但又出现了新的error: AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

可以参考issue #7

xxrrnn commented 4 months ago
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 20, in <module>
    train()
  File "/root/autodl-tmp/GraphGPT/graphgpt/train/train_graph.py", line 871, in train
    model_graph_dict = model.get_model().initialize_graph_modules(
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 139, in initialize_graph_modules
    clip_graph, args= load_model_pretrained(CLIP, self.config.pretrain_graph_model_path) 
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 54, in load_model_pretrained
    assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing

不好意思,我这里还有error,想问下是哪里出了问题呢?在vicuna的config中写的pretrain_model_path地址下是有config.json的,但依然报错

{
  "_name_or_path": "vicuna-7b-v1.5-16k",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_sequence_length": 16384,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000, 
  "graph_hidden_size": 128, 
  "pretrain_graph_model_path": "/root/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT/"
}
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT# ls
clip_gt_arxiv_pub.pkl  config.json