aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

`kms_key` in `sagemaker.processing.Processor` should default to `output_kms_key` #4874

Open admivsn opened 1 month ago

admivsn commented 1 month ago

Describe the bug The kms_key used to encrypt either the user code file or local inputs when uploading to S3 should default to output_kms_key.

This would align the behaviour of with sagemaker.estimator.Estimator where output_kms_key is used to encrypt the tar'd user training code when uploading to S3.

Also, since output_kms_key is resolved from the config it means that kms_key can inherit this default from the config.

To reproduce A clear, step-by-step set of instructions to reproduce the bug. The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.

Expected behavior The kms_key should default to output_kms_key. This can be implemented in either:

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

cylaceste commented 1 month ago

This is well-thought out! My thoughts are: output_kms_key being applied to the input code doesn't quite make sense from a naming perspective.

And in the estimator case, it's only when the buckets are the same (output and code) that it's applied.

kms_key = self.output_kms_key if code_bucket == output_bucket else None

Probably because it's generally safe to assume if the bucket is the same and the output has some encryption, the same encryption can be applied to the code, whereas not applying it to the code might not work for uploading due to bucket policies. Maybe this logic was added as a fix to get around some common user errors caused by bucket encryption policies.

Actually it might make sense for the estimator to encrypt the code with a kms_keythat is not the output_kms_key.