Closed psycoder-sup closed 14 hours ago
First of all, I'll try to use trace model (no script, no loop logic). Just inference next one token.
used below code
class CoreMLExporter(BaseModelExporter):
"""
Export the model to CoreML.
"""
@classmethod
def run(cls, model: "BaseModelWrapper", output_name: str = "coreml_model.mlpackage") -> None:
if not output_name.endswith(".mlpackage") and not output_name.endswith(".mlmodel"):
raise ValueError("Output name must end with .mlpackage or .mlmodel")
dummy_input = torch.randint(1, 1000, (1, 200))
with profile("torch.jit.trace"):
traced_model = torch.jit.trace(model.model, dummy_input)
ct_shape = ct.Shape(shape=(1, ct.RangeDim(1, 1000)))
with profile("coremltools.convert"):
mlmodel = ct.convert(
traced_model,
convert_to="mlprogram",
inputs=[ct.TensorType(shape=ct_shape, dtype=np.int64)],
source="pytorch",
)
mlmodel.save(output_name) # type: ignore
Got tracing warning:
Torch version 2.4.1+cu121 has not been tested with coremltools. You may run into unexpected errors. Torch 2.3.0 is the most recent version that has been tested.
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9.87it/s]
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/modeling_utils.py:4674: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
warnings.warn(
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:1090: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
You are not running the flash-attention implementation, expect numerical differences.
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:205: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.original_max_position_embeddings:
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:208: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:401: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:417: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Conversion code:
# coreml.py
@classmethod
def run(cls, model: "BaseModelWrapper", output_name: str = "coreml_model.mlpackage") -> None:
if not output_name.endswith(".mlpackage") and not output_name.endswith(".mlmodel"):
raise ValueError("Output name must end with .mlpackage or .mlmodel")
if isinstance(model, HuggingFaceModelWrapper):
target_model = model.model
dummy_input = torch.randint(1000, (1, 100))
target_model.eval()
target_model.to("cuda")
with profile("torch.jit.trace"):
with torch.no_grad():
traced_model = torch.jit.trace(target_model, dummy_input.to("cuda"))
ct_shape = ct.Shape(shape=(1, ct.RangeDim(1, 1000)))
with profile("coremltools.convert"):
mlmodel = ct.convert(
traced_model,
convert_to="neuralnetwork",
inputs=[ct.TensorType(shape=ct_shape, dtype=np.int64, name="input_ids")],
source="pytorch",
)
mlmodel.save(output_name) # type: ignore
else:
raise ValueError(f"Unsupported model type: {type(model)}")
But when I move the .mlpackage
file into mac mini M1, and tested using python interface, the results of most of the layer were nan
example output:
'hidden_states_727': array([[[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
...,
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]]], dtype=float32)}
I found similar issues in core ml github repository
It says, converting into mlprogram
might cause the problem. Try neuralnetwork
instead of mlprogram
.
another similar issue here
from this issue
Thanks for the reporting. You might already check this, if not, could you check this? https://huggingface.co/blog/mistral-coreml
Thanks for the reporting. You might already check this, if not, could you check this? https://huggingface.co/blog/mistral-coreml
I've resolve nan
bugs by following instructions from the mentioned issues in this comment.
For coreML converting, below step is required
trace model
When we trace model it is required to use
torch.jit.trace
. However, LLM utilizes loop forwarding inside of forward function. Inside of the generation logic the transformer hasgenerate
function which supports, beam search and other language generation utilities.However, as far as I know model.generate(...) function does not support torch jit tracing.
Therefore, we have to trace model with
forward()
function, so that we have to implement extragenerate
function to fully generate the sentences and control max output token length.| I have a question related to
onnx
export, do they also have similar limitations?related resources: https://discuss.huggingface.co/t/generate-method-for-models-converted-to-torchscript/47728/2
Script model
After tracing, due to its loop logic to generate sentences, we need to convert traced model into script model to handle
loop
logic. Below is the sample code of loop.With this code, we cannot control the max_length. I need to figure out how to implement generate() function to handle
max_length
.sample inputs and outputs
inputs
outputs