Netflix / metaflow-tools

:rocket: Deployment tools/scripts for Metaflow!
http://www.metaflow.org
Apache License 2.0
52 stars 47 forks source link

Updating Sagemaker Notebooks and small changes to the compute resources #12

Closed tobias-gp closed 3 years ago

tobias-gp commented 4 years ago

Changes to metaflow-cfn-template.yml:

  1. Removed all SageMaker notebook dependencies
  2. Explicitely created the ECS instance role
  3. Changed the compute environment to not hold any machines in standby for cost-reasons
  4. Adding P-Instances to enable GPU compute
  5. Autoformatter removed some trailing spaces :)

New file metaflow-notebook-cfn-template.yml:

  1. Changed the permissions to also execute Metaflow tasks (not just use the client)
  2. Lifecycle config changes: Write config file to $METAFLOW_HOME, shut down notebook after idle time, and automatically pull the Metaflow Tutorials to the Jupyter Notebook home in /home/ec2-user/SageMaker
jsaedtler commented 4 years ago

Looks good to me.

Do you have any kind of testing for your templates at netflix @savingoyal ?

savingoyal commented 4 years ago

@tobias-gp Would it be helpful to make SageMaker setup optional in the first template? That way, consumers will only have to worry about a single template. Other changes look good to me.

@jsaedtler We have been relying on @queueburt's mad cfn skills for testing our templates so far :)

tobias-gp commented 4 years ago

Hi Savin,

Thanks for your feedback! Just to be clear: do you mean (1) a full merge of the two templates and making the setup optional or (2) retaining the two templates with Metaflow stack (+ notebook stack), and notebook stack. This would of course induce some redundancy.

The reason why we separated the two was because of the 1:N relationship of stacks to notebooks. I am fine with any solution, but in that case (1) seems more logical.

tobias-gp commented 4 years ago

Ping :)

savingoyal commented 4 years ago

Oops. Sorry, I thought I replied earlier, but clearly not.

  1. Yes, ideally it would be better to just have a single cloud formation template (with optional Sagemaker setup). We can have separate notebooks only template in a different folder in addition.
  2. We can leave the Default for MinVCPUBatch to 8 and pin the MinValue to 0. The reason, why we chose not to set the default to 0 was because then it invariably leads to higher job start-up times leading to degraded user experience. By default, we can leave it to a non-zero value and our cost-conscious users can override it to 0 (which is currently not the case at the tip of the master.)
savingoyal commented 4 years ago

Also, we are making a few more changes to the permissions in #13 to support SFN.

tobias-gp commented 4 years ago

@savingoyal : Thanks for your feedback.

  1. I have just removed the NB template, which we could do in a separate PR
  2. MinVCPUBatch is now set to 8 with MinValue set to 0.
  3. I have also added a section in Readme.md to discuss potential cost savings

I wanted to also include an automatic extraction of the API key, but ran into a dependency problem with the IAM policy. The reason is that we are operating on a conditional resource, and the creation of the policy itself cannot be conditional.

If you have a better idea, let me know :)

        - PolicyName: GetApiKey
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              Sid: "AllowGetApiKey"
              Effect: Allow
              Action: 
                - "apigateway:GET"
              Resource: !Join ["", ["arn:aws:apigateway:", !Ref "AWS::Region", "::", "/apikeys/", !Ref ApiKey ]]

And the lifecycle config would be:

              # retrieve the API key, if APIBasicAuth is set
              if [[ "${APIBasicAuth}" == "true" ]]; then 
                service_auth_key=`aws apigateway get-api-key --api-key ${ApiKey} --include-value | jq -r '.value'`
                echo "\"METAFLOW_SERVICE_AUTH_KEY\": \"$service_auth_key\"," >> $metaflow_config_file
              fi
savingoyal commented 4 years ago

@queueburt Can you assist on this PR?

tobias-gp commented 4 years ago

Ping? :)

queueburt commented 4 years ago

Ping? :)

Hey Tobias, I have a PR I just finished that includes necessary bits for Step Functions here (#13), but I also incorporated some of your feedback along the way regarding non-singular ECS Task roles and optionality of Sagemaker Notebooks. I'll glance over this PR early next week and try to work out what gaps we need to reconcile between the two, but could I trouble you to do the same for mine? I'd love to combine these both together into one clean result.

tobias-gp commented 4 years ago

@queueburt : Sorry for the slow reply. I agree that bringing in the structural changes first makes sense!