aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
432 stars 138 forks source link

Different results compared to Pytorch model #376

Closed bedapisl closed 2 years ago

bedapisl commented 2 years ago

Hello, I converted my model to Neuron, but it is now giving different results then the original Pytorch model. Output probabilities are different and final predictions are different in about 10% of cases. Example code:

    model = ...
    example_input = ...

    print(example_input)

    neuron_model = torch.neuron.trace(model.model, example_input, dynamic_batch_size=True)

    pred1 = neuron_model(example_input[0], example_input[1])
    pred2 = model.model(example_input[0], example_input[1])

    print(pred1)
    print(pred2)

which gives the following output:

(tensor([[  101, 69863, 21799, 13050, 10126, 11856,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]]), tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]))
INFO:Neuron:There are 2 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 261, fused = 254, percent fused = 97.32%
WARNING:tensorflow:From /home/pislb/.local/lib/python3.6/site-packages/torch_neuron/ops/aten.py:1330: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:Neuron:Compiling function _NeuronGraph$267 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/pislb/.local/bin/neuron-cc compile /tmp/tmp86q9o3u0/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp86q9o3u0/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 128], "int64"]}, "outputs": ["Softmax_6:0", "Add_60:0"]} --verbose 35'
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
....Analyzing dependencies of sg00/Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Warning 2: scheduling level for block downgraded to 1.
.12/13/2021 01:31:43 PM INFO [Stargazer]: Generating Arch 'Inferentia-1.0'
12/13/2021 01:31:43 PM INFO [Stargazer]: INFO: Pre SG DRAM bytes loaded or saved 139513370
12/13/2021 01:31:43 PM INFO [Stargazer]: INFO: Pre SG average DMA size 4937 bytes
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Loads in Func = 1108
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Saves in Func = 104
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Input Loads in Func= 234
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Output Saves in Func= 2
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Spill Loads in Func= 874
12/13/2021 01:31:43 PM INFO [Stargazer]: Num Spill Saves in Func= 102
12/13/2021 01:31:46 PM INFO [Stargazer]: Wavegraph code generation for Inferentia:
12/13/2021 01:31:46 PM INFO [Stargazer]:     Engine              File
12/13/2021 01:31:46 PM INFO [Stargazer]:     ------              ----
12/13/2021 01:31:46 PM INFO [Stargazer]:     PE-Array            pe.bin
12/13/2021 01:31:46 PM INFO [Stargazer]:     Pool-Eng            pool.bin
12/13/2021 01:31:46 PM INFO [Stargazer]:     Act-Eng             act.bin
12/13/2021 01:31:46 PM INFO [Stargazer]: 
12/13/2021 01:31:47 PM INFO [Stargazer]: Fixing data race is 0
12/13/2021 01:31:47 PM INFO [Stargazer]: Data race checker engines
12/13/2021 01:31:47 PM INFO [Stargazer]: [Sailfish] Data race analysis initially
12/13/2021 01:31:49 PM INFO [Stargazer]: [Sailfish] Data race analysis found no races, run time: 0:00:02
12/13/2021 01:31:49 PM INFO [Stargazer]: [Sailfish] Remove redundant edges
12/13/2021 01:31:49 PM INFO [Stargazer]: Data race checker engines
12/13/2021 01:31:50 PM INFO [Stargazer]: Transitive reduction start 
12/13/2021 01:31:50 PM INFO [Stargazer]: Transitive reduction removed 34 redundant edges, time: 0:00:00
12/13/2021 01:31:50 PM INFO [Stargazer]: Sync Critical Load Chains Start
12/13/2021 01:31:50 PM DEBUG [Stargazer]: SyncCritLoads buildLoadGraph Start...
12/13/2021 01:31:50 PM DEBUG [Stargazer]: SyncCritLoads buildLoadGraph Done.
12/13/2021 01:31:50 PM DEBUG [Stargazer]: Load Graph NumRoots; 1
12/13/2021 01:31:50 PM INFO [Stargazer]: Sync Critical Load Chains added 19 new Load-2-Load syncs
12/13/2021 01:31:50 PM INFO [Stargazer]: Sync Critical Load Chains Done.0:00:00
12/13/2021 01:31:53 PM WARNING [Stargazer]: SBUF DMA write size != 0 mod 4: SBUF address=0x98296, size=2
12/13/2021 01:31:53 PM INFO [Stargazer]: Out wavegraph bin file is wavegraph-bin.json
12/13/2021 01:31:53 PM INFO [Stargazer]: Writing NN JSON to file 'wavegraph-bin.json'
.12/13/2021 01:32:01 PM INFO [Stargazer]: Virtual memory peak = 6868604 K bytes
12/13/2021 01:32:01 PM INFO [Stargazer]: PASSED - Total time: 0:00:17
......
Compiler status PASS
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 261, compiled = 254, percent compiled = 97.32%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 39
INFO:Neuron: => aten::add: 12
INFO:Neuron: => aten::contiguous: 6
INFO:Neuron: => aten::div: 7
INFO:Neuron: => aten::dropout: 13
INFO:Neuron: => aten::eq: 6
INFO:Neuron: => aten::expand: 1
INFO:Neuron: => aten::expand_as: 6
INFO:Neuron: => aten::gelu: 6
INFO:Neuron: => aten::layer_norm: 13
INFO:Neuron: => aten::linear: 37
INFO:Neuron: => aten::masked_fill: 6
INFO:Neuron: => aten::matmul: 12
INFO:Neuron: => aten::mul: 1
INFO:Neuron: => aten::size: 15
INFO:Neuron: => aten::softmax: 7
INFO:Neuron: => aten::sum: 2
INFO:Neuron: => aten::to: 4
INFO:Neuron: => aten::transpose: 30
INFO:Neuron: => aten::unsqueeze: 1
INFO:Neuron: => aten::view: 30
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 1 [supported]
INFO:Neuron: => aten::add: 1 [supported]
INFO:Neuron: => aten::embedding: 2 [not supported]
INFO:Neuron: => aten::size: 1 [supported]
INFO:Neuron: => aten::slice: 2 [supported]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
(tensor([[0.3261, 0.3678, 0.3075]]), tensor([[ 0.0140,  0.1357, -0.0447]]))
(tensor([[0.1643, 0.6104, 0.2253]], grad_fn=<SoftmaxBackward>), tensor([[-0.5786,  0.7338, -0.2629]], grad_fn=<AddmmBackward>))

Note how different are the model outputs on the last lines. Neuron gives probabilities [0.3261, 0.3678, 0.3075] but Pytorch gives [0.1643, 0.6104, 0.2253]. The model is DistillBERT with added linear layer:

class ClassifierNetwork(torch.nn.Module):

    def __init__(self, model_name):
        super(ClassifierNetwork, self).__init__()

        if model_name == 'distilbert-base-multilingual-cased':
            hidden_dim = 768
        elif model_name == 'nboost/pt-tinybert-msmarco':
            hidden_dim = 312
        elif model_name == 'bert-base-multilingual-uncased':
            hidden_dim = 768

        self.transformer = AutoModel.from_pretrained(model_name)
        self.output_layer = torch.nn.Linear(hidden_dim, 3)
        self.model_name = model_name

    def forward(self, input_ids, attention_mask):
        if self.model_name == 'distilbert-base-multilingual-cased':
            all_outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask, return_dict=False)[0]
        else:
            raise ValueError(f'Unsupported model name {self.model_name}.')

        mask = torch.unsqueeze(attention_mask, 2).expand(all_outputs.shape)
        masked_all_outputs = (mask * all_outputs).float()

        average_output = torch.sum(masked_all_outputs, 1).float() / torch.sum(mask, 1).float()

        average_output = average_output.float()

        logits = self.output_layer(average_output)

        probabilities = torch.nn.functional.softmax(logits, dim=1)
        return probabilities, logits
aws-taylor commented 2 years ago

Hello @bedapisl,

We are investigating this issue. In the mean time, I would point you at the --fast-math flag (documented here). This flag allows you to control tradeoff between performance and accuracy for fp32 operators.

We will update this issue as we learn more.

-Taylor

shebbur-aws commented 2 years ago

Hello @bedapisl,

We could reproduce and root cause the issue. We are currently working on a fix. Will update once its fixed and tested.

Thanks, Shruthi.

shebbur-aws commented 2 years ago

Hello @bedapisl,

We have implemented a fix for the aten::mul operator translation that will resolve the numerical mismatch issues you are seeing. The fix is expected in one of our upcoming releases.

Thanks, Shruthi.

aws-joshim commented 2 years ago

@bedapisl we will update this ticket with the release details that includes the fix

awsrjh commented 2 years ago

Hi

This has been resolved in the recent 1.17.0 release.

Please find more information here:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#neuron-1-17-0-01-20-2022