aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
457 stars 153 forks source link

Compiled model output for e5 model doesn't match model output #750

Open aabbi opened 1 year ago

aabbi commented 1 year ago

We are trying to run the e5 model on an inf2 instance. The model compiles fine and analyze reports no unsupported operators but when trying it out on an example we see differing output from the CPU version vs the neuron compiled version when compared at the 4th decimal place and above (dec=3 and below match) We also tried this on a set of 300 examples and saw a difference in 10% of the output. Unfortunately, this is resulting in different output in the application

Here's the script we are using

import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModel
import numpy as np

model_path = 'intfloat/e5-large-v2'
batch_size = 1
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, torchscript=True)
model.eval()

# Set up some example inputs                                                                                                                                                                                                                                                              
sequence_0 = "The company HuggingFace is based in New York City"
inputs = tokenizer([sequence_0], return_tensors="pt", max_length=512, padding="max_length", truncation=True)
attention_mask = inputs["attention_mask"]
input_ids = inputs["input_ids"]

# Run the original PyTorch model on CPU                                                                                                                                                                                                                                              
output_cpu = model(*(input_ids, attention_mask))
print("CPU RUN COMPLETED")

# Compile the model for Neuron                                                                                                                                                                                                                                                            
print("input_ids", input_ids.size())
print("attention_mask", attention_mask.size())
kwargs = {'compiler_args': ['--auto-cast', 'none']}
model_neuron = torch_neuronx.trace(model, (input_ids, attention_mask), **kwargs)
print("TRACING COMPLETED")

# # Save the TorchScript for inference deployment                                                                                                                                                                                                                                         
traced_filename = f'e5_traced_model.pt'
torch.jit.save(model_neuron, traced_filename)
print("SAVING COMPLETED")

# Load the TorchScript compiled model                                                                                                                                                                                                                                                     
model_neuron_loaded = torch.jit.load(traced_filename)

# Run inference using the Neuron model                                                                                                                                                                                                                                                    
output_neuron = model_neuron_loaded(*(input_ids, attention_mask))

orig_logits = output_cpu[0][0][:10]
target_logits = output_neuron[0][0][:10]

# # Compare the results                                                                                                                                                                                                                                                                   
print(f"CPU last_hidden_state:    {orig_logits}")
print(f"Neuron last_hidden_state: {target_logits}")

print(np.testing.assert_almost_equal(orig_logits.detach().numpy(), target_logits, decimal=4))

and here's the output

CPU last_hidden_state:    tensor([[-0.0602, -1.0032,  0.2657,  ..., -1.1824,  0.7361,  0.6564],
        [ 0.0425, -1.3461,  0.6137,  ..., -1.0303,  0.2837,  0.3446],
        [ 0.6920, -1.2250,  0.5260,  ..., -1.2215,  0.3612,  0.3568],
        ...,
        [ 0.0852, -1.2544,  0.4905,  ..., -1.1209,  0.1062,  0.2485],
        [ 0.0747, -1.2812,  0.3300,  ..., -1.3801,  0.2274,  0.3231],
        [ 0.1351, -1.1897,  0.4969,  ..., -1.3630,  0.5207,  0.5292]],
       grad_fn=<SliceBackward0>)
Neuron last_hidden_state: tensor([[-0.0603, -1.0033,  0.2657,  ..., -1.1827,  0.7360,  0.6564],
        [ 0.0424, -1.3462,  0.6138,  ..., -1.0303,  0.2837,  0.3446],
        [ 0.6920, -1.2250,  0.5259,  ..., -1.2215,  0.3611,  0.3567],
        ...,
        [ 0.0852, -1.2545,  0.4905,  ..., -1.1209,  0.1062,  0.2485],
        [ 0.0746, -1.2812,  0.3300,  ..., -1.3801,  0.2274,  0.3231],
        [ 0.1350, -1.1898,  0.4970,  ..., -1.3632,  0.5207,  0.5293]])
Traceback (most recent call last):
  File "simple_script.py", line 47, in <module>
    print(np.testing.assert_almost_equal(orig_logits.detach().numpy(), target_logits, decimal=4))
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 583, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1046, in assert_array_almost_equal
    assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 4 decimals

Mismatched elements: 891 / 10240 (8.7%)
Max absolute difference: 0.00044715
Max relative difference: 0.46428102
 x: array([[-0.0602, -1.0032,  0.2657, ..., -1.1824,  0.7361,  0.6564],
       [ 0.0425, -1.3461,  0.6137, ..., -1.0303,  0.2837,  0.3446],
       [ 0.692 , -1.225 ,  0.526 , ..., -1.2215,  0.3612,  0.3568],...
 y: array([[-0.0603, -1.0033,  0.2657, ..., -1.1827,  0.736 ,  0.6564],
       [ 0.0424, -1.3462,  0.6138, ..., -1.0303,  0.2837,  0.3446],
       [ 0.692 , -1.225 ,  0.5259, ..., -1.2215,  0.3611,  0.3567],...

torch/neuron-sdk related package versions

libneuronxla==0.5.476
neuronx-cc==2.10.0.34+6c8792c6f
neuronx-hwm==2.10.0.5+7b1976adf
torch-neuronx==1.13.1.1.11.0
torch-xla==1.13.1+torchneuronb
torch==1.13.1
torchvision==0.14.1
aws-neuronx-runtime-discovery==2.9

We were expecting the model to be supported but not exactly sure if it is. (I believe the original authors report a link to the code here ) If so, any pointers to what could be going wrong here or what else we could try ?

aws-taylor commented 1 year ago

Hello @aabbi,

Numeric issues such as this are often related to differences in data types. You can learn more about the data types supported on Neuron cores here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/data-types.html. The Neuron compiler also has options that allow you to tune for trading between performance and accuracy. You may find some of the options described in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#neuronx-cc-training-mixed-precision helpful.

I'd encourage you to experiment with the --auto-cast and --auto-cast-type compiler arguments and see if there's a sweet spot between accuracy and performance for your application.

Regards, Taylor

aabbi commented 1 year ago

hello @aws-taylor thanks for getting back to us. if you notice above , we set the --auto-cast to none via this line kwargs = {'compiler_args': ['--auto-cast', 'none']} The differences reported above are for that option. (When set to something other than none, there are much larger differences as expected) since auto-cast was set to none, we wouldn't have expected differences at the 4th dec. place. for as large a proportion of the cases

mrnikwaws commented 1 year ago

Hi @aabbi,

I'm checking on the accuracy question internally (i.e. how close do we expect the numbers to be with auto-cast set to none), but I wanted to confirm that such a small absolute difference (4.5e-4) makes a difference for your application. The high relative difference is likely due to a small absolute value.

Often we'll care more about overall accuracy on some corpus on input/output pairs (e.g. how well do the output tokens match expected outputs after decoding). Unless you are using the logits directly the exact values may not matter.

aabbi commented 11 months ago

Hi @mrnikwaws In this case that small absolute difference did make a difference in the application - which is basically information retrieval. The model in question is used to generate text embeddings. We use this model to generate and store embeddings for various pieces of text. On seeing a query, we use the model to generate the embedding for the query , then get a ranked list of nearest documents based on these embeddings by comparing cosine distance for these vectors. This computed distance is the one that is ultimately compared.

We had used the untraced model to generate and store embeddings. But then tried using the traced model at query time. and noticed that the rankings for the nearest documents can be drastically different when compared with the rankings generated with the untraced model.

Its possible that if we used the traced model during both phases that there would be less difference in ranks. But we had to give up after this and it didn't seem worth it to dig further. Its also possible that the model needs to be customized and can't be traced as is/correctly without any modifications.

aws-taylor commented 5 months ago

Hello @aabbi,

Do you have the name of the operation/operator you're using from which you're seeing precision mismatches or a reproduction? We can investigate options for teasing out more precision out of that operator.

mrnikwaws commented 3 months ago

Hi @aabbi,

In your case I would strongly recommend that you encode your reference and retrieval embeddings in the same way (neuron + neuron or cpu + cpu).

Even when forcing FP32 compute (effectively what --auto-cast none does) we can see small differences due to how the numbers are computed. Here is a stack overflow on CPU vs GPU: https://stackoverflow.com/questions/13937328/division-of-floating-point-numbers-on-gpu-different-from-that-on-cpu - but the same logic applies.

When a model is lowered to neuron (compiled) we carry out scheduling and fusing of operators in a model. This can make a difference in floating point outputs (usually small absolute values). For tasks that involve ranking (think topk scores, token decoding tasks) this usually makes little to no difference.

In your case you are trying to compare scores from two subtly different computation paths, and if you are depending on high precision here. Hence the recommendation.