Open kalpesh22-21 opened 5 months ago
Converting model is not hardware specific. So, you can convert the ckpts on A100 and build engine on L40S.
Thanks, for prompt response, I bypassed the conversion phase, by converting the model in 8bit on A100’s . With TP2 resulting in a checkpoint of 45 GB i.e. upto 22 GB per rank.
I have 2 L40s multiGPU node, where I am trying to build an Engine using command below:
trtllm-build --checkpoint_/models/tensor-rt-models/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --output_dir /models/tensor-rt-engine/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --gemm_plugin bfloat16 \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --remove_input_padding enable \ --workers 2 \ --cluster_key L40S \ --max_batch_size 16 \ --max_input_len 4096 \ --max_output_len 2048 \ --use_paged_context_fmha enable \ --gpt_attention_plugin bfloat16
But I am facing Error: “[resizingAllocator.cpp::allocate::62] Error Code 1: Cuda Runtime (out of memory)”
I tried changing max_batch_size, max_input_length, max_output_length to lowest size possible assuming that will decrease the required KV cache allocated memory but the problem still prevails.
Is there anything you can suggest ?
Could you try batch size 1 and share the full log?
For 2 L40s, I am facing trouble building not only the engine but also converting the model to 8bit.
Script 1 :
python $TENSORRT_REPO/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 --output_dir /models/tensor-rt-models/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --dtype bfloat16 \ --use_weight_only \ --weight_only_precision int8 \ --load_model_on_cpu \ --tp_size 2
Problem: Above script failed with CUDA memory error, even with load_model_on_cpu which is strange and looks like a bug, thus cannot convert weights into 8bits.
Soution: I tried to bypass the conversion phase, by converting the model in 8bit on A100’s.
Is this allowed? or the conversion also have to be done on hardware which will be used to build the engine ?