aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
423 stars 136 forks source link

Training accuracy issue when using `bf16` for multi-class classification with > 10 labels on Trainium #667

Open philschmid opened 1 year ago

philschmid commented 1 year ago

Hello,

We created an example on how to fine-tune BERT on the Banking77 dataset, which has 77 labels, which works totally fine and achieves and f1 score of 0.84 (which is still 9% lower than one GPU) but when we activate bf16 the f1 score drops to 0.02 and it is completely garbage. Similar to this the train loss is not decreasing.

How to reproduce.

  1. Start a Trainium instance either the Hugging Face AMI or base AMI and install optimum neuron
  2. copy the notebook
  3. run through the notebook without changes
  4. change the torchrun command and add bf16 parameter.
awsilya commented 1 year ago

Hi @philschmid, thank you for letting us know. We will investigate.

awsilya commented 1 year ago

Hi @philschmid . We are debugging your issue. In the meantime could you try a couple of alternatives:

if training_args is not None:
            if training_args.bf16:
                torch.cuda.is_bf16_supported = lambda: True
                os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "1"
        #        training_args.bf16 = False
        #        os.environ["XLA_USE_BF16"] = "1"

For example:

XLA_USE_BF16=true torchrun --nproc_per_node=2 train.py  --model_id bert-base-uncased  --dataset_path lm_dataset  --lr 2e-5  --per_device_train_batch_size 8  --epochs 3

...........
{'eval_loss': 1.2047382593154907, 'eval_f1': 0.8204884578328744, 'eval_runtime': 11.9553, 'eval_samples_per_second': 257.626, 'eval_steps_per_second': 16.143, 'epoch': 3.0}
philschmid commented 1 year ago

Thank you we will try that but 0.8204884578328744 is 10% worse than what GPUs get with BF16.