Open robwhelan opened 3 years ago
This is my command to start the training job:
estimator = PyTorch(
entry_point="train_deploy.py",
source_dir="code_chesterton",
role=role,
framework_version="1.5",
py_version="py3",
instance_count=2, # this script only support distributed training for GPU instances.
instance_type="ml.p3.8xlarge",
debugger_hook_config=False,
)
estimator.fit({"training": inputs_train, "validation": inputs_valid})
In the test script the following tokenizer function when invoked while mapping the dataset changes the datatype of 'os.environ' from 'os._Environ' to 'dict'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
This causes get() method in 'dict' class to fail as it does not support 'default' keyword argument.
IMO we should file an issue with transformers package.
We have filed an issue here: https://github.com/huggingface/datasets/issues/2115
I have run into this issue recently. I use the HuggingFace container because I found it supported on SageMaker. The command is (I referred this doc about versions of HuggingFace container)
estimator = HuggingFace(
entry_point='train.py',
role=role,
instance_type='ml.p3.2xlarge',
instance_count=1,
transformers_version='4.4.2',
pytorch_version='1.6.0',
py_version='py36'
)
Later I found this issue is solved in the newest version of container (thanks to the contributors) After upgrading to sagemaker==2.62.0, we can use
estimator = HuggingFace(
entry_point='train.py',
role=role,
instance_type='ml.p3.2xlarge',
instance_count=1,
transformers_version='4.11.0',
pytorch_version='1.9.0',
py_version='py38'
)
I have been fine-tuning
distilbert
from the HuggingFace Transformers project. When callingtrainer.train()
, somewheresmdebug
tries to callos.environ.get()
and I get the above error.There are no other messages.
It affects this line:
/smdebug/core/logger.py", line 51, in get_logger
whether or not I setdebugger_hook_config=False