trace model

When we trace model it is required to use torch.jit.trace. However, LLM utilizes loop forwarding inside of forward function. Inside of the generation logic the transformer has generate function which supports, beam search and other language generation utilities.

However, as far as I know model.generate(...) function does not support torch jit tracing.

Therefore, we have to trace model with forward() function, so that we have to implement extra generate function to fully generate the sentences and control max output token length.

| I have a question related to onnx export, do they also have similar limitations?

Script model

After tracing, due to its loop logic to generate sentences, we need to convert traced model into script model to handle loop logic. Below is the sample code of loop.


for _ in range(64):
    predictions, _ = model(input_ids)

    token = torch.argmax(predictions[:, -1, :], dim=-1).unsqueeze(1)
    input_ids = torch.cat([input_ids, token], dim=1)

    if token == eos_token:
        break

With this code, we cannot control the max_length. I need to figure out how to implement generate() function to handle max_length.

sample inputs and outputs

inputs

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

outputs

<|system|> You are a helpful assistant.<|end|><|user|> How to explain Internet for a medieval knight?<|end|><|assistant|> Explaining the Internet to a medieval knight would require a creative approach, as the concept is anachronistic and fundamentally different from the technology and societal structures of the Middle Ages. Here's how you might attempt to convey the idea:

1. **Start with the concept

psycoder-sup commented 2 days ago

First of all, I'll try to use trace model (no script, no loop logic). Just inference next one token.

psycoder-sup commented 2 days ago

Tracing models' warning

used below code

class CoreMLExporter(BaseModelExporter):
    """
    Export the model to CoreML.
    """

    @classmethod
    def run(cls, model: "BaseModelWrapper", output_name: str = "coreml_model.mlpackage") -> None:
        if not output_name.endswith(".mlpackage") and not output_name.endswith(".mlmodel"):
            raise ValueError("Output name must end with .mlpackage or .mlmodel")
        dummy_input = torch.randint(1, 1000, (1, 200))
        with profile("torch.jit.trace"):
            traced_model = torch.jit.trace(model.model, dummy_input)
        ct_shape = ct.Shape(shape=(1, ct.RangeDim(1, 1000)))
        with profile("coremltools.convert"):
            mlmodel = ct.convert(
                traced_model,
                convert_to="mlprogram",
                inputs=[ct.TensorType(shape=ct_shape, dtype=np.int64)],
                source="pytorch",
            )
        mlmodel.save(output_name)  # type: ignore

Got tracing warning:

Torch version 2.4.1+cu121 has not been tested with coremltools. You may run into unexpected errors. Torch 2.3.0 is the most recent version that has been tested.
Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  9.87it/s]
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/modeling_utils.py:4674: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:1090: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:
You are not running the flash-attention implementation, expect numerical differences.
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:205: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.original_max_position_embeddings:
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:208: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:401: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/home/sanguk/miniconda3/envs/latched/lib/python3.11/site-packages/transformers/models/phi3/modeling_phi3.py:417: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):

psycoder-sup commented 2 days ago

Conversion success

Conversion code:

# coreml.py
   @classmethod
    def run(cls, model: "BaseModelWrapper", output_name: str = "coreml_model.mlpackage") -> None:
        if not output_name.endswith(".mlpackage") and not output_name.endswith(".mlmodel"):
            raise ValueError("Output name must end with .mlpackage or .mlmodel")

        if isinstance(model, HuggingFaceModelWrapper):
            target_model = model.model
            dummy_input = torch.randint(1000, (1, 100))
            target_model.eval()
            target_model.to("cuda")
            with profile("torch.jit.trace"):
                with torch.no_grad():
                    traced_model = torch.jit.trace(target_model, dummy_input.to("cuda"))

            ct_shape = ct.Shape(shape=(1, ct.RangeDim(1, 1000)))
            with profile("coremltools.convert"):
                mlmodel = ct.convert(
                    traced_model,
                    convert_to="neuralnetwork",
                    inputs=[ct.TensorType(shape=ct_shape, dtype=np.int64, name="input_ids")],
                    source="pytorch",
                )
            mlmodel.save(output_name)  # type: ignore
        else:
            raise ValueError(f"Unsupported model type: {type(model)}")

But when I move the .mlpackage file into mac mini M1, and tested using python interface, the results of most of the layer were nan

example output:

'hidden_states_727': array([[[[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]],

        [[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]],

        [[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]],

        ...,

        [[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]],

        [[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]],

        [[nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         ...,
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]]]], dtype=float32)}

I found similar issues in core ml github repository

It says, converting into mlprogram might cause the problem. Try neuralnetwork instead of mlprogram.

another similar issue here

psycoder-sup commented 2 days ago

Int 64 error:

from this issue

sungmanch commented 17 hours ago

Thanks for the reporting. You might already check this, if not, could you check this? https://huggingface.co/blog/mistral-coreml

psycoder-sup commented 15 hours ago

Thanks for the reporting. You might already check this, if not, could you check this? https://huggingface.co/blog/mistral-coreml

I've resolve nan bugs by following instructions from the mentioned issues in this comment.

TBD-Labs-AI / latched

While scripting, usual model wrapper won't work. #30

trace model

Script model

sample inputs and outputs

Tracing models' warning

Conversion success

Int 64 error: