Open LeSoleilGo opened 2 months ago
hi Le, I had the same problem, have you solved it yet?
I face the same problem, I think you can also see the problem like this:
(ParaWorker pid=1955026) Error: Peer-to-peer access is unsupported on this platform. (ParaWorker pid=1955026) In the current version of distserve, it is necessary to use a platform that supports GPU P2P access. (ParaWorker pid=1955026) Exiting...
Idk how to solve the problem, maybe you can @me if you have any solutions, thank you.
hi Le, I had the same problem, have you solved it yet?你好,Le,我也遇到了同样的问题,你现在解决了吗?
I think if you want to use pp, you need to have multiple nodes. This is my guess。And TP can be used correctly.
hi Le, I had the same problem, have you solved it yet?你好,Le,我也遇到了同样的问题,你现在解决了吗?
I think if you want to use pp, you need to have multiple nodes. This is my guess。And TP can be used correctly.
i use CUDA_VISIBLE_DEVICES=0 python examples/offline.py
, and it show :
(autoscaler +6s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +6s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
(autoscaler +41s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
It doesn't seem to be a problem with multiple nodes, but it may be caused by specifying the local model with --model. You can convert the model like I did, copy the relevant configuration file, and then use the converted model to verify it:
python distserve/downloader/converter.py \
--input "/workspace/models/opt-13b/*.bin" \
--output /workspace/models/distserve/opt-13b \
--dtype float16 \
--model opt
cp *.{json,txt} /workspace/models/distserve/opt-13b
and then use:
--model /workspace/models/distserve/opt-13b
hi Le, I had the same problem, have you solved it yet?你好,Le,我也遇到了同样的问题,你现在解决了吗?
I think if you want to use pp, you need to have multiple nodes. This is my guess。And TP can be used correctly.
i use
CUDA_VISIBLE_DEVICES=0 python examples/offline.py
, and it show :(autoscaler +6s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +6s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. (autoscaler +41s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
It doesn't seem to be a problem with multiple nodes, but it may be caused by specifying the local model with --model. You can convert the model like I did, copy the relevant configuration file, and then use the converted model to verify it:
python distserve/downloader/converter.py \ --input "/workspace/models/opt-13b/*.bin" \ --output /workspace/models/distserve/opt-13b \ --dtype float16 \ --model opt cp *.{json,txt} /workspace/models/distserve/opt-13b
and then use:
--model /workspace/models/distserve/opt-13b
Your comment works well, now I can run it. Thanks for your help!
I use offline example. When I use tensor_parallel_size=1, pipeline_parallel_size=1, the result is correct. But when tensor_parallel_size=2, pipeline_parallel_size=2, it will loop infinitely. As show below.
Here is my code. Thank you.