Closed apresunreve closed 1 month ago
We need more information.
Can you try something like this:
# since the vision model use pytorch, this env variable may help check is the error is happend in pytorch
export CUDA_LAUNCH_BLOCKING=1
# open lmdeploy info log, and when the program hands, please provide the log information.
pipe = pipeline('liuhaotian/llava-v1.6-34b', backend_config=TurbomindEngineConfig(tp=8), log_level='INFO')
response = pipe(input_batches)
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
Checklist
Describe the bug
I am using the LlaVA 34B model to generate captions for a large image dataset. I am running the pipeline with tp=8 on 8xV100 GPUs. It runs normally for some time, but nearly always hangs after certain number of batches (e.g., 1k~5k batches). nvidia-smi shows one GPU has 0% utilization.
Reproduction
pipe = pipeline('liuhaotian/llava-v1.6-34b', backend_config=TurbomindEngineConfig(tp=8)) response = pipe(input_batches)
Environment
Error traceback
No response