Open aabbi opened 1 year ago
Hello @aabbi,
Numeric issues such as this are often related to differences in data types. You can learn more about the data types supported on Neuron cores here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/data-types.html. The Neuron compiler also has options that allow you to tune for trading between performance and accuracy. You may find some of the options described in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#neuronx-cc-training-mixed-precision helpful.
I'd encourage you to experiment with the --auto-cast
and --auto-cast-type
compiler arguments and see if there's a sweet spot between accuracy and performance for your application.
Regards, Taylor
hello @aws-taylor
thanks for getting back to us.
if you notice above , we set the --auto-cast
to none
via this line
kwargs = {'compiler_args': ['--auto-cast', 'none']}
The differences reported above are for that option. (When set to something other than none, there are much larger differences as expected)
since auto-cast was set to none, we wouldn't have expected differences at the 4th dec. place. for as large a proportion of the cases
Hi @aabbi,
I'm checking on the accuracy question internally (i.e. how close do we expect the numbers to be with auto-cast set to none), but I wanted to confirm that such a small absolute difference (4.5e-4) makes a difference for your application. The high relative difference is likely due to a small absolute value.
Often we'll care more about overall accuracy on some corpus on input/output pairs (e.g. how well do the output tokens match expected outputs after decoding). Unless you are using the logits directly the exact values may not matter.
Hi @mrnikwaws In this case that small absolute difference did make a difference in the application - which is basically information retrieval. The model in question is used to generate text embeddings. We use this model to generate and store embeddings for various pieces of text. On seeing a query, we use the model to generate the embedding for the query , then get a ranked list of nearest documents based on these embeddings by comparing cosine distance for these vectors. This computed distance is the one that is ultimately compared.
We had used the untraced model to generate and store embeddings. But then tried using the traced model at query time. and noticed that the rankings for the nearest documents can be drastically different when compared with the rankings generated with the untraced model.
Its possible that if we used the traced model during both phases that there would be less difference in ranks. But we had to give up after this and it didn't seem worth it to dig further. Its also possible that the model needs to be customized and can't be traced as is/correctly without any modifications.
Hello @aabbi,
Do you have the name of the operation/operator you're using from which you're seeing precision mismatches or a reproduction? We can investigate options for teasing out more precision out of that operator.
Hi @aabbi,
In your case I would strongly recommend that you encode your reference and retrieval embeddings in the same way (neuron + neuron or cpu + cpu).
Even when forcing FP32 compute (effectively what --auto-cast none does) we can see small differences due to how the numbers are computed. Here is a stack overflow on CPU vs GPU: https://stackoverflow.com/questions/13937328/division-of-floating-point-numbers-on-gpu-different-from-that-on-cpu - but the same logic applies.
When a model is lowered to neuron (compiled) we carry out scheduling and fusing of operators in a model. This can make a difference in floating point outputs (usually small absolute values). For tasks that involve ranking (think topk scores, token decoding tasks) this usually makes little to no difference.
In your case you are trying to compare scores from two subtly different computation paths, and if you are depending on high precision here. Hence the recommendation.
We are trying to run the e5 model on an inf2 instance. The model compiles fine and analyze reports no unsupported operators but when trying it out on an example we see differing output from the CPU version vs the neuron compiled version when compared at the 4th decimal place and above (dec=3 and below match) We also tried this on a set of 300 examples and saw a difference in 10% of the output. Unfortunately, this is resulting in different output in the application
Here's the script we are using
and here's the output
torch/neuron-sdk related package versions
We were expecting the model to be supported but not exactly sure if it is. (I believe the original authors report a link to the code here ) If so, any pointers to what could be going wrong here or what else we could try ?