flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.39k stars 576 forks source link

Identify the action items for adapting custom training in HPO #467

Open bnsblue opened 4 years ago

bnsblue commented 4 years ago
bnsblue commented 4 years ago

I created an HPOJob with the following CRD

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
  name: sm-custom-hpo
spec:
  region: us-east-1
  tags:
    - key: test-key
      value: test-value
  hyperParameterTuningJobConfig:
    strategy: Bayesian
    hyperParameterTuningJobObjective:
      type: Minimize
      metricName: validation:error
    resourceLimits:
      maxNumberOfTrainingJobs: 10
      maxParallelTrainingJobs: 5
    parameterRanges:
      integerParameterRanges:
      - name: num_round
        minValue: '10'
        maxValue: '20'
        scalingType: Linear
      continuousParameterRanges: []
      categoricalParameterRanges: []
    trainingJobEarlyStoppingType: Auto
  trainingJobDefinition:
    staticHyperParameters:
      - name: __FLYTE_ENTRYPOINT_SELECTOR__
        value: "SAGEMAKER"
      - name: base_score
        value: '0.5'
      - name: booster
        value: gbtree
      - name: csv_weights
        value: '0'
      - name: dsplit
        value: row
      - name: grow_policy
        value: depthwise
      - name: lambda_bias
        value: '0.0'
      - name: max_bin
        value: '256'
      - name: max_leaves
        value: '0'
      - name: normalize_type
        value: tree
      - name: objective
        value: reg:linear
      - name: one_drop
        value: '0'
      - name: prob_buffer_row
        value: '1.0'
      - name: process_type
        value: default
      - name: rate_drop
        value: '0.0'
      - name: refresh_leaf
        value: '1'
      - name: sample_type
        value: uniform
      - name: scale_pos_weight
        value: '1.0'
      - name: silent
        value: '0'
      - name: sketch_eps
        value: '0.03'
      - name: skip_drop
        value: '0.0'
      - name: tree_method
        value: auto
      - name: tweedie_variance_power
        value: '1.5'
      - name: updater
        value: grow_colmaker,prune
    algorithmSpecification:
      trainingImage: <image>
      trainingInputMode: File
      metricDefinitions:
      - name: validation:error
        regex: 'validation error'
    roleArn: <roleARN>
    inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
         s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    outputDataConfig:
      s3OutputPath: s3://my-bucket/xgboost
    resourceConfig:
      instanceType: ml.m4.xlarge
      instanceCount: 1
      volumeSizeInGB: 25
    stoppingCondition:
      maxRuntimeInSeconds: 3600
    enableNetworkIsolation: true
    enableInterContainerTrafficEncryption: false

and this is the log of one of the underlying trainingjob https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs/log-events/9c6aeeed28af4dd48d0ff14af57ec168-005-1db03577$252Falgo-1-1597092293

In which it can be easily seen that SageMaker still uses the same way to pass hyperparameters to the underlying training job

SM_USER_ARGS=["--__FLYTE_ENTRYPOINT_SELECTOR__","SAGEMAKER","--base_score","0.5","--booster","gbtree","--csv_weights","0","--dsplit","row","--grow_policy","depthwise","--lambda_bias","0.0","--max_bin","256","--max_leaves","0","--normalize_type","tree","--num_round","14","--objective","reg:linear","--one_drop","0","--prob_buffer_row","1.0","--process_type","default","--rate_drop","0.0","--refresh_leaf","1","--sample_type","uniform","--scale_pos_weight","1.0","--silent","0","--sketch_eps","0.03","--skip_drop","0.0","--tree_method","auto","--tweedie_variance_power","1.5","--updater","grow_colmaker,prune"]
...
SM_HP___FLYTE_ENTRYPOINT_SELECTOR__=SAGEMAKER
...

Invoking script with the following command:
/usr/bin/python3 flyte_entrypoint_selector.py --__FLYTE_ENTRYPOINT_SELECTOR__ SAGEMAKER --base_score 0.5 --booster gbtree --csv_weights 0 --dsplit row --grow_policy depthwise --lambda_bias 0.0 --max_bin 256 --max_leaves 0 --normalize_type tree --num_round 14 --objective reg:linear --one_drop 0 --prob_buffer_row 1.0 --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune

This just confirms that the training job underlying a hpo job also uses the same interface as that of a standalone training job.

Ok, now we will have to figure out how to pass them to user's python function... are hyperparameter inputs? if so we will have to download and manipulate the inputs?.

What SageMaker does is that it will put a summarized map of hyperparameters and values to the path /opt/ml/input/config/hyperparameters.json inside your container, and their wrapper script parses that file and passes the hyperparameters to the user script as command-line arguments

bnsblue commented 4 years ago

I found a potential problem when doing the experiment: In hpo with custom training jobs, if the metrics are not well defined, the training job could hang and never end.

‼️ This may lead to potential operational difficulty and extra cost if not handled carefully.

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 month ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏