Running a ScriptMode job on SageMaker modifies the order of the provided arguments.
For example, say my script train.py takes two arguments --config and --batch_size. If the hyperparameter argument on the Estimator class is set to {'config': exp_config_path, 'batch_size': 10}, then I would expect SageMaker training toolkit to invoke the script as follows:
python train.py --config exp_config_path --batch_size 10
However, the toolkit sorts the hyperparameters alphanumerically first before invoking the script, hence resulting in the following invocation:
python train.py --batch_size 10 --config exp_config_path
The issue with this reordering is because sometimes the order of the arguments matters. For example, when using jsonargparse the order of invocations can result in two different argument settings. If the order is config>batch_size in script invocation, then batch_size argument is first loaded from the config file and subsequently is overridden by the command line argument --batch_size 10. On the other hand, if the order is batch_size>config, then the batch_size argument value is taken from the config.
Running a ScriptMode job on SageMaker modifies the order of the provided arguments.
For example, say my script
train.py
takes two arguments--config
and--batch_size
. If thehyperparameter
argument on theEstimator
class is set to{'config': exp_config_path, 'batch_size': 10}
, then I would expect SageMaker training toolkit to invoke the script as follows:python train.py --config exp_config_path --batch_size 10
However, the toolkit sorts the hyperparameters alphanumerically first before invoking the script, hence resulting in the following invocation:python train.py --batch_size 10 --config exp_config_path
This happens because of a single line: https://github.com/aws/sagemaker-training-toolkit/blob/628166c157751ae2a46fddc11a7a8cac765fb22c/src/sagemaker_training/mapping.py#L78
The issue with this reordering is because sometimes the order of the arguments matters. For example, when using
jsonargparse
the order of invocations can result in two different argument settings. If the order is config>batch_size in script invocation, thenbatch_size
argument is first loaded from the config file and subsequently is overridden by the command line argument--batch_size 10
. On the other hand, if the order is batch_size>config, then thebatch_size
argument value is taken from the config.I don't really see a reason why the hyperparameters should be sorted, so I think it would be safe to remove this sorting. This would be easy to fix by removing sorting in https://github.com/aws/sagemaker-training-toolkit/blob/628166c157751ae2a46fddc11a7a8cac765fb22c/src/sagemaker_training/mapping.py#L78?plain=1
Reproducing the bug
Here's a sketch to reproduce the bug:
Running this code will result in the invocation with reordered hyperparameters, which can be seen from the logs:
Instead I would expect the training toolkit to invoke the following command: