Closed JKYtydt closed 4 months ago
can i have the entire command line you are using?
@evilsocket 您好,我的命令如下:
CUDA_VISIBLE_DEVICES=3 ./cake-cli --model /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct --mode worker --name worker0 --topology topology.yml --address 0.0.0.0:10128
[2024-07-16T08:49:58Z INFO ] [Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
[2024-07-16T08:49:58Z INFO ] loading topology from topology.yml
[2024-07-16T08:49:58Z INFO ] loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)
ls -la /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/ ?
@evilsocket 您好,我运行了您提供的命令,结果如下
ls -la /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/
total 15693224
drwxrwxr-x 3 omnisky omnisky 4096 Jun 21 16:36 .
drwxrwxr-x 97 omnisky omnisky 4096 Jul 9 17:52 ..
-rw-rw-r-- 1 omnisky omnisky 1519 May 22 19:04 .gitattributes
drwxrwxr-x 2 omnisky omnisky 4096 May 22 19:04 .ipynb_checkpoints
-rw-rw-r-- 1 omnisky omnisky 1391 May 22 19:04 README.md
-rw-rw-r-- 1 omnisky omnisky 1003 May 22 19:04 config.json
-rw-rw-r-- 1 omnisky omnisky 9437 May 22 19:04 configuration_llama.py
-rw-rw-r-- 1 omnisky omnisky 121 May 22 19:04 generation_config.json
-rw-rw-r-- 1 omnisky omnisky 4976698592 May 22 19:04 model-00001-of-00004.safetensors
-rw-rw-r-- 1 omnisky omnisky 4999802616 May 22 19:04 model-00002-of-00004.safetensors
-rw-rw-r-- 1 omnisky omnisky 4915916080 May 22 19:04 model-00003-of-00004.safetensors
-rw-rw-r-- 1 omnisky omnisky 1168138808 May 22 19:04 model-00004-of-00004.safetensors
-rw-rw-r-- 1 omnisky omnisky 23950 May 22 19:04 model.safetensors.index.json
-rw-rw-r-- 1 omnisky omnisky 73580 May 22 19:04 modeling_llama.py
-rw-rw-r-- 1 omnisky omnisky 301 May 22 19:04 special_tokens_map.json
-rw-rw-r-- 1 omnisky omnisky 9084463 May 22 19:04 tokenizer.json
-rw-rw-r-- 1 omnisky omnisky 50941 May 22 19:04 tokenizer_config.json
can you run it with RUST_LOG=debug cake-cli ... ?
I need to improve the error logging :P
@evilsocket 您好,我使用了这个命令,好像没有特别详细的信息,但是我发现我设置的可用GPU没有生效,是默认使用第一张GPU吗?由此猜想是不是因为我的第一张GPU没有多余的显存才报错呢,如果是这样是否有办法指定哪张GPU
CUDA_VISIBLE_DEVICES=3 RUST_LOG=debug ./cake-cli --model /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct --mode worker --name worker0 --topology topol
ogy.yml --address 0.0.0.0:10128
[2024-07-16T10:02:23Z DEBUG] device is cuda 0
[2024-07-16T10:02:23Z INFO ] [Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
[2024-07-16T10:02:23Z INFO ] loading topology from topology.yml
[2024-07-16T10:02:23Z INFO ] loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)
我的猜想好像是错误的,因为我的主节点第一张卡是可用的,依旧会报没有找到指定文件
RUST_LOG=debug ./cake-cli --model /data1/pre_trained_model/Llama-3-8B-Instruct --topology topology.yml
[2024-07-16T10:20:21Z DEBUG] device is cuda 0
[2024-07-16T10:20:22Z INFO ] [Master] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=222 MiB
[2024-07-16T10:20:22Z INFO ] loading topology from topology.yml
[2024-07-16T10:20:22Z INFO ] loading configuration from /data1/pre_trained_model/Llama-3-8B-Instruct/config.json
Error: No such file or directory (os error 2)
i'll push a fix to improve error logging as soon as possible so we can debug this better 👍🏻
@JKYtydt i have the feeling it's not finding the topology.yml file, I added some more logging, you can try to rebuild with the new logs and/or just make sure that the topology.yml exists
模型路径是正确的,不知道这个报错指的是哪个文件找不到