aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

When running BERT pretraining tutorial, seeing errors ``RuntimeError: unable to open file <> in read-only mode: No such file or directory `` #888

Open jeffhataws opened 1 month ago

jeffhataws commented 1 month ago

When running BERT pretraining tutorial https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/bert.html#hf-bert-pretraining-tutorial you may see the following errors:


Traceback (most recent call last):
  File "dp_bert_large_hf_pretrain_hdf5.py", line 625, in <module>
    _mp_fn(0, args)
  File "dp_bert_large_hf_pretrain_hdf5.py", line 584, in _mp_fn
    train_bert_hdf5(flags)
  File "dp_bert_large_hf_pretrain_hdf5.py", line 269, in train_bert_hdf5
    model = get_model(flags)
  File "dp_bert_large_hf_pretrain_hdf5.py", line 224, in get_model
    base_model = BertForPreTraining.from_pretrained('bert-large-uncased')
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2301, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 402, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
RuntimeError: unable to open file </home/ubuntu/hf_cache/compute1-dy-kaena-training-0-1/hub/models--bert-large-uncased/snapshots/6da4b6a26a1877e173fca3225479512db81a5e5b/model.safetensors> in read-only mode: No such file or directory (2)

The work-around is to pin huggingface-hub version to 0.22:

pip install huggingface-hub==0.22