无法找到指定的文件

JKYtydt commented 4 months ago

模型路径是正确的，不知道这个报错指的是哪个文件找不到

[Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
 loading topology from topology.yml
loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)

evilsocket commented 4 months ago

can i have the entire command line you are using?

JKYtydt commented 4 months ago

@evilsocket 您好，我的命令如下：

CUDA_VISIBLE_DEVICES=3 ./cake-cli --model /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct --mode worker --name worker0 --topology topology.yml --address 0.0.0.0:10128
[2024-07-16T08:49:58Z INFO ] [Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
[2024-07-16T08:49:58Z INFO ] loading topology from topology.yml
[2024-07-16T08:49:58Z INFO ] loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)

evilsocket commented 4 months ago

ls -la /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/ ?

JKYtydt commented 4 months ago

@evilsocket 您好，我运行了您提供的命令，结果如下

ls -la /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/
total 15693224
drwxrwxr-x  3 omnisky omnisky       4096 Jun 21 16:36 .
drwxrwxr-x 97 omnisky omnisky       4096 Jul  9 17:52 ..
-rw-rw-r--  1 omnisky omnisky       1519 May 22 19:04 .gitattributes
drwxrwxr-x  2 omnisky omnisky       4096 May 22 19:04 .ipynb_checkpoints
-rw-rw-r--  1 omnisky omnisky       1391 May 22 19:04 README.md
-rw-rw-r--  1 omnisky omnisky       1003 May 22 19:04 config.json
-rw-rw-r--  1 omnisky omnisky       9437 May 22 19:04 configuration_llama.py
-rw-rw-r--  1 omnisky omnisky        121 May 22 19:04 generation_config.json
-rw-rw-r--  1 omnisky omnisky 4976698592 May 22 19:04 model-00001-of-00004.safetensors
-rw-rw-r--  1 omnisky omnisky 4999802616 May 22 19:04 model-00002-of-00004.safetensors
-rw-rw-r--  1 omnisky omnisky 4915916080 May 22 19:04 model-00003-of-00004.safetensors
-rw-rw-r--  1 omnisky omnisky 1168138808 May 22 19:04 model-00004-of-00004.safetensors
-rw-rw-r--  1 omnisky omnisky      23950 May 22 19:04 model.safetensors.index.json
-rw-rw-r--  1 omnisky omnisky      73580 May 22 19:04 modeling_llama.py
-rw-rw-r--  1 omnisky omnisky        301 May 22 19:04 special_tokens_map.json
-rw-rw-r--  1 omnisky omnisky    9084463 May 22 19:04 tokenizer.json
-rw-rw-r--  1 omnisky omnisky      50941 May 22 19:04 tokenizer_config.json

evilsocket commented 4 months ago

can you run it with RUST_LOG=debug cake-cli ... ?

I need to improve the error logging :P

JKYtydt commented 4 months ago

@evilsocket 您好，我使用了这个命令，好像没有特别详细的信息，但是我发现我设置的可用GPU没有生效，是默认使用第一张GPU吗？由此猜想是不是因为我的第一张GPU没有多余的显存才报错呢，如果是这样是否有办法指定哪张GPU

CUDA_VISIBLE_DEVICES=3 RUST_LOG=debug ./cake-cli --model /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct --mode worker --name worker0 --topology topol
ogy.yml --address 0.0.0.0:10128
[2024-07-16T10:02:23Z DEBUG] device is cuda 0
[2024-07-16T10:02:23Z INFO ] [Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
[2024-07-16T10:02:23Z INFO ] loading topology from topology.yml
[2024-07-16T10:02:23Z INFO ] loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)

JKYtydt commented 4 months ago

我的猜想好像是错误的，因为我的主节点第一张卡是可用的，依旧会报没有找到指定文件

RUST_LOG=debug ./cake-cli --model /data1/pre_trained_model/Llama-3-8B-Instruct --topology topology.yml
[2024-07-16T10:20:21Z DEBUG] device is cuda 0
[2024-07-16T10:20:22Z INFO ] [Master] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=222 MiB
[2024-07-16T10:20:22Z INFO ] loading topology from topology.yml
[2024-07-16T10:20:22Z INFO ] loading configuration from /data1/pre_trained_model/Llama-3-8B-Instruct/config.json
Error: No such file or directory (os error 2)

evilsocket commented 4 months ago

i'll push a fix to improve error logging as soon as possible so we can debug this better 👍🏻

evilsocket commented 4 months ago

@JKYtydt i have the feeling it's not finding the topology.yml file, I added some more logging, you can try to rebuild with the new logs and/or just make sure that the topology.yml exists

evilsocket / cake

无法找到指定的文件 #8