Open bnsblue opened 4 years ago
I created an HPOJob with the following CRD
apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
name: sm-custom-hpo
spec:
region: us-east-1
tags:
- key: test-key
value: test-value
hyperParameterTuningJobConfig:
strategy: Bayesian
hyperParameterTuningJobObjective:
type: Minimize
metricName: validation:error
resourceLimits:
maxNumberOfTrainingJobs: 10
maxParallelTrainingJobs: 5
parameterRanges:
integerParameterRanges:
- name: num_round
minValue: '10'
maxValue: '20'
scalingType: Linear
continuousParameterRanges: []
categoricalParameterRanges: []
trainingJobEarlyStoppingType: Auto
trainingJobDefinition:
staticHyperParameters:
- name: __FLYTE_ENTRYPOINT_SELECTOR__
value: "SAGEMAKER"
- name: base_score
value: '0.5'
- name: booster
value: gbtree
- name: csv_weights
value: '0'
- name: dsplit
value: row
- name: grow_policy
value: depthwise
- name: lambda_bias
value: '0.0'
- name: max_bin
value: '256'
- name: max_leaves
value: '0'
- name: normalize_type
value: tree
- name: objective
value: reg:linear
- name: one_drop
value: '0'
- name: prob_buffer_row
value: '1.0'
- name: process_type
value: default
- name: rate_drop
value: '0.0'
- name: refresh_leaf
value: '1'
- name: sample_type
value: uniform
- name: scale_pos_weight
value: '1.0'
- name: silent
value: '0'
- name: sketch_eps
value: '0.03'
- name: skip_drop
value: '0.0'
- name: tree_method
value: auto
- name: tweedie_variance_power
value: '1.5'
- name: updater
value: grow_colmaker,prune
algorithmSpecification:
trainingImage: <image>
trainingInputMode: File
metricDefinitions:
- name: validation:error
regex: 'validation error'
roleArn: <roleARN>
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: <s3_path>
s3DataDistributionType: FullyReplicated
contentType: text/csv
compressionType: None
recordWrapperType: None
inputMode: File
- channelName: validation
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: <s3_path>
s3DataDistributionType: FullyReplicated
contentType: text/csv
compressionType: None
recordWrapperType: None
inputMode: File
outputDataConfig:
s3OutputPath: s3://my-bucket/xgboost
resourceConfig:
instanceType: ml.m4.xlarge
instanceCount: 1
volumeSizeInGB: 25
stoppingCondition:
maxRuntimeInSeconds: 3600
enableNetworkIsolation: true
enableInterContainerTrafficEncryption: false
and this is the log of one of the underlying trainingjob https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs/log-events/9c6aeeed28af4dd48d0ff14af57ec168-005-1db03577$252Falgo-1-1597092293
In which it can be easily seen that SageMaker still uses the same way to pass hyperparameters to the underlying training job
SM_USER_ARGS=["--__FLYTE_ENTRYPOINT_SELECTOR__","SAGEMAKER","--base_score","0.5","--booster","gbtree","--csv_weights","0","--dsplit","row","--grow_policy","depthwise","--lambda_bias","0.0","--max_bin","256","--max_leaves","0","--normalize_type","tree","--num_round","14","--objective","reg:linear","--one_drop","0","--prob_buffer_row","1.0","--process_type","default","--rate_drop","0.0","--refresh_leaf","1","--sample_type","uniform","--scale_pos_weight","1.0","--silent","0","--sketch_eps","0.03","--skip_drop","0.0","--tree_method","auto","--tweedie_variance_power","1.5","--updater","grow_colmaker,prune"]
...
SM_HP___FLYTE_ENTRYPOINT_SELECTOR__=SAGEMAKER
...
Invoking script with the following command:
/usr/bin/python3 flyte_entrypoint_selector.py --__FLYTE_ENTRYPOINT_SELECTOR__ SAGEMAKER --base_score 0.5 --booster gbtree --csv_weights 0 --dsplit row --grow_policy depthwise --lambda_bias 0.0 --max_bin 256 --max_leaves 0 --normalize_type tree --num_round 14 --objective reg:linear --one_drop 0 --prob_buffer_row 1.0 --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune
This just confirms that the training job underlying a hpo job also uses the same interface as that of a standalone training job.
What SageMaker does is that it will put a summarized map of hyperparameters and values to the path /opt/ml/input/config/hyperparameters.json
inside your container, and their wrapper script parses that file and passes the hyperparameters to the user script as command-line arguments
I found a potential problem when doing the experiment: In hpo with custom training jobs, if the metrics are not well defined, the training job could hang and never end.
‼️ This may lead to potential operational difficulty and extra cost if not handled carefully.
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏
HyperparameterTuningJobTask
should copy thetimeout
value from its underlyingCustomTrainingJobTask
outputs
of the underlyingCustomTrainingJobTask
should be embedded into the output interface ofHyperparameterTuningJobTask
inputs.pb