Closed xxrrnn closed 4 months ago
我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?
您好,请问您可以提供具体的报错信息嘛。
@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
以下是该指令的报错:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-20_09:24:02
host : e2b5ff656edd
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 23256)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-20_09:24:02
host : e2b5ff656edd
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 23257)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-20_09:24:02
host : e2b5ff656edd
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 23258)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-20_09:24:02
host : e2b5ff656edd
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 23255)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?
@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
以下是该指令的报错:(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python Traceback (most recent call last): File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module> main() File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ graphgpt/train/train_mem.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-20_09:24:02 host : e2b5ff656edd rank : 1 (local_rank: 1) exitcode : 1 (pid: 23256) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-20_09:24:02 host : e2b5ff656edd rank : 2 (local_rank: 2) exitcode : 1 (pid: 23257) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-20_09:24:02 host : e2b5ff656edd rank : 3 (local_rank: 3) exitcode : 1 (pid: 23258) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-20_09:24:02 host : e2b5ff656edd rank : 0 (local_rank: 0) exitcode : 1 (pid: 23255) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?
我的报错信息和ta的一样
同样遇到该问题
wandb
您好,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?同时,能否放出您使用的脚本的具体内容,可以帮助我们帮您debug。
@tjb-tech 您好,我现在已经在GraphGPT目录下运行了,下面是脚本内容和运行结果: 除了no module外也有其他报错,希望能获得帮助,感谢。
# to fill in the following path to run the first stage of our GraphGPT! model_path= ./vicuna-7b-v1.5-16k instruct_ds= ./data/graph_matching.json graph_data_path=./graph_data/all_graph_data.pt pretra_gnn= clip_gt_arxiv output_model=../stage_1 wandb offline python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \ graphgpt/train/train_mem.py \ --model_name_or_path ${model_path} \ --version v1 \ --data_path ${instruct_ds} \ --graph_content ./arxiv_ti_ab.json \ --graph_data_path ${graph_data_path} \ --graph_tower ${pretra_gnn} \ --tune_graph_mlp_adapter True \ --graph_select_layer -2 \ --use_graph_start_end \ --bf16 True \ --output_dir ${output_model} \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2400 \ --save_total_limit 1 \ --learning_rate 2e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True \ --report_to wandb
运行结果:
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls LICENSE assets data graphgpt images requirements.txt stage_1 text-graph-grounding vicuna-7b-v1.5-16k README.md clip_gt_arxiv_pub.pkl graph_data graphgpt_stage1.sh playground scripts tests vicuna-7b-v1.5 wandb (graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh ./scripts/tune_script/graphgpt_stage1.sh ./scripts/tune_script/graphgpt_stage1.sh: 2: ./vicuna-7b-v1.5-16k: Permission denied ./data/graph_matching.json: 1: Syntax error: Unterminated quoted string ./scripts/tune_script/graphgpt_stage1.sh: 5: clip_gt_arxiv: not found W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B. Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 10, in <module> from graphgpt.train.train_graph import train ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 10, in <module> from graphgpt.train.train_graph import train ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 10, in <module> from graphgpt.train.train_graph import train ModuleNotFoundError: No module named 'graphgpt' Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 10, in <module> from graphgpt.train.train_graph import train ModuleNotFoundError: No module named 'graphgpt' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19608) of binary: /root/miniconda3/envs/graphgpt/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module> main() File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ graphgpt/train/train_mem.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-30_14:34:46 host : autodl-container-d05b4ca599-9ff30ed2 rank : 1 (local_rank: 1) exitcode : 1 (pid: 19609) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-30_14:34:46 host : autodl-container-d05b4ca599-9ff30ed2 rank : 2 (local_rank: 2) exitcode : 1 (pid: 19610) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-30_14:34:46 host : autodl-container-d05b4ca599-9ff30ed2 rank : 3 (local_rank: 3) exitcode : 1 (pid: 19611) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-30_14:34:46 host : autodl-container-d05b4ca599-9ff30ed2 rank : 0 (local_rank: 0) exitcode : 1 (pid: 19608) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
您好,您的shell脚本中的以下内容填写有错,shell脚本变量的“=”之间不能有空格,可能会导致编译错误:
model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv
应该改为:
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
另外,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?
我刚才也尝试了重新在GraphGPT下运行,结果同样,下面是我的运行指令:
# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1
wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
graphgpt/train/train_mem.py \
--model_name_or_path ${model_path} \
--version v1 \
--data_path ${instruct_ds} \
--graph_content ./arxiv_ti_ab.json \
--graph_data_path ${graph_data_path} \
--graph_tower ${pretra_gnn} \
--tune_graph_mlp_adapter True \
--graph_select_layer -2 \
--use_graph_start_end \
--bf16 True \
--output_dir ${output_model} \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
我的报错信息:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 122699) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-30_06:36:27
host : e2b5ff656edd
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 122700)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-30_06:36:27
host : e2b5ff656edd
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 122701)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-30_06:36:27
host : e2b5ff656edd
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 122702)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-30_06:36:27
host : e2b5ff656edd
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 122699)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@tjb-tech 感谢您之前的回复,我已经修改了对应的内容,并且在GraphGPT文件夹下运行sh文件,但是依然有no module named graphgpt的问题。 文件内容:
# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
graphgpt/train/train_mem.py \
--model_name_or_path ${model_path} \
--version v1 \
--data_path ${instruct_ds} \
--graph_content ./arxiv_ti_ab.json \
--graph_data_path ${graph_data_path} \
--graph_tower ${pretra_gnn} \
--tune_graph_mlp_adapter True \
--graph_select_layer -2 \
--use_graph_start_end \
--bf16 True \
--output_dir ${output_model} \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
输出:
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE assets data graphgpt images requirements.txt stage_1 text-graph-grounding vicuna-7b-v1.5-16k
README.md clip_gt_arxiv_pub.pkl graph_data graphgpt_stage1.sh playground scripts tests vicuna-7b-v1.5 wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh graphgpt_stage1.sh
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 10, in <module>
from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 10, in <module>
from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 10, in <module>
from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 10, in <module>
from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19846) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-30_14:47:56
host : autodl-container-d05b4ca599-9ff30ed2
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 19847)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-30_14:47:56
host : autodl-container-d05b4ca599-9ff30ed2
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 19848)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-30_14:47:56
host : autodl-container-d05b4ca599-9ff30ed2
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 19849)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-30_14:47:56
host : autodl-container-d05b4ca599-9ff30ed2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 19846)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
checkpoints
您好,这个问题我之前都没遇到过,您可以试试把python换成python3再试试嘛,或者python -m torch.distributed.run
换成torchrun
具体做法:
# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1
wandb offline
python3 -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
graphgpt/train/train_mem.py \
--model_name_or_path ${model_path} \
--version v1 \
--data_path ${instruct_ds} \
--graph_content ./arxiv_ti_ab.json \
--graph_data_path ${graph_data_path} \
--graph_tower ${pretra_gnn} \
--tune_graph_mlp_adapter True \
--graph_select_layer -2 \
--use_graph_start_end \
--bf16 True \
--output_dir ${output_model} \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
nproc_per_node
可以尝试一下上面的方法
@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
File "graphgpt/train/train_mem.py", line 4, in <module>
File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-30_06:54:22
host : e2b5ff656edd
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 123077)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-30_06:54:22
host : e2b5ff656edd
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 123078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-30_06:54:22
host : e2b5ff656edd
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 123079)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-30_06:54:22
host : e2b5ff656edd
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 123076)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh graphgpt_stage1.sh: line 8: wandb: command not found WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt' from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3 Traceback (most recent call last): File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module> main() File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ graphgpt/train/train_mem.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 1 (local_rank: 1) exitcode : 1 (pid: 123077) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 2 (local_rank: 2) exitcode : 1 (pid: 123078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 3 (local_rank: 3) exitcode : 1 (pid: 123079) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 0 (local_rank: 0) exitcode : 1 (pid: 123076) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
您有尝试过torchrun
嘛,或者把python3 -m torch.distributed.run
改成python3.8 -m torch.distributed.run
或者 python3.8 torch.distributed.run
。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。
@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh graphgpt_stage1.sh: line 8: wandb: command not found WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt' from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3 Traceback (most recent call last): File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module> main() File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ graphgpt/train/train_mem.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 1 (local_rank: 1) exitcode : 1 (pid: 123077) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 2 (local_rank: 2) exitcode : 1 (pid: 123078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 3 (local_rank: 3) exitcode : 1 (pid: 123079) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 0 (local_rank: 0) exitcode : 1 (pid: 123076) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
您有尝试过
torchrun
嘛,或者把python3 -m torch.distributed.run
改成python3.8 -m torch.distributed.run
或者python3.8 torch.distributed.run
。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。
好的,python3.8 -m torch.distributed.run
运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory
@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh graphgpt_stage1.sh: line 8: wandb: command not found WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> File "graphgpt/train/train_mem.py", line 4, in <module> Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 4, in <module> from graphgpt.train.llama_flash_attn_monkey_patch import ( from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt' from graphgpt.train.llama_flash_attn_monkey_patch import ( ModuleNotFoundError: No module named 'graphgpt' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3 Traceback (most recent call last): File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module> main() File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ graphgpt/train/train_mem.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 1 (local_rank: 1) exitcode : 1 (pid: 123077) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 2 (local_rank: 2) exitcode : 1 (pid: 123078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 3 (local_rank: 3) exitcode : 1 (pid: 123079) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-30_06:54:22 host : e2b5ff656edd rank : 0 (local_rank: 0) exitcode : 1 (pid: 123076) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
您有尝试过
torchrun
嘛,或者把python3 -m torch.distributed.run
改成python3.8 -m torch.distributed.run
或者python3.8 torch.distributed.run
。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。好的,
python3.8 -m torch.distributed.run
运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh graphgpt_stage1.sh: line 8: wandb: command not found python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory
您好,现在错误的原因还未清晰(我这边运行没有问题,但是有同学反映有问题)。作为暂时的解决方案,您可以在train_mem.py的最开头中加入以下代码来显性添加路径:
import os
import sys
curPath = os.path.abspath(os.path.dirname(__file__))
rootPath = os.path.split(os.path.split(curPath)[0])[0]
print(curPath, rootPath)
sys.path.append(rootPath)
我们后续会看下到底是什么原因导致的这个问题
我通过修改环境变量的方式解决了no module这个问题: 参考链接中方法2 https://blog.csdn.net/weixin_48594878/article/details/120461124
我使用的是: export PYTHONPATH=$PYTHONPATH:/root/autodl-tmp/GraphGPT/graphgpt source /etc/profile
graphgpt
感谢您的回答!
解决了no module的报错后,又出现了以下的报错,请问该如何解决呢?
You are using a model of type llama to instantiate a model of type GraphLlama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1283 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1284) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
--nnodes=1 --nproc_per_node=4
这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4
吗
--nnodes=1 --nproc_per_node=4
这个是在模型load的时候出了问题,请问您有根据机器修改
--nnodes=1 --nproc_per_node=4
吗
好的,我已经修改了,但又出现了新的error:
AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'
--nnodes=1 --nproc_per_node=4
这个是在模型load的时候出了问题,请问您有根据机器修改
--nnodes=1 --nproc_per_node=4
吗好的,我已经修改了,但又出现了新的error:
AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'
可以参考issue #7
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 20, in <module>
train()
File "/root/autodl-tmp/GraphGPT/graphgpt/train/train_graph.py", line 871, in train
model_graph_dict = model.get_model().initialize_graph_modules(
File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 139, in initialize_graph_modules
clip_graph, args= load_model_pretrained(CLIP, self.config.pretrain_graph_model_path)
File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 54, in load_model_pretrained
assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing
不好意思,我这里还有error,想问下是哪里出了问题呢?在vicuna的config中写的pretrain_model_path地址下是有config.json的,但依然报错
{
"_name_or_path": "vicuna-7b-v1.5-16k",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_sequence_length": 16384,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 4.0,
"type": "linear"
},
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32000,
"graph_hidden_size": 128,
"pretrain_graph_model_path": "/root/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT/"
}
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT# ls
clip_gt_arxiv_pub.pkl config.json
我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?