Open JingyaHuang opened 5 months ago
Hi @JingyaHuang,
When weights are not inlined, there are some effects that can reduce performance:
--auto-cast
data type. If you do not -explicitly downcast the model weights yourself this can mean that the underlying model may consume an fp32
weight and then be required to downcast it at runtime to fp16
/bf16
for subsequent auto-casted compute.nn.Parameter
s only which may improve performance of inline masking/tensors/scalars.We can look into this specific model and see which of the above effects is causing poor performance.
Hi team,
The Optimum Neuron team observed a quite large difference in latency when the model is compiled with non-inlined weights/neff.
TL;DR
The latency of non-inlined sd models is almost 3X that of inlined models.
Reproduction
Compilation
Inference
Results
We already place manually the weights to Neuron devices through this PR: https://github.com/huggingface/optimum-neuron/pull/584. Is there any other things that we could and should do to improve the latency while the weights/neff are not in-lined? The current performance of non-inlined models is not encouraging.