aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

[Bug Report] Model Evaluation in abalone HPO example, for 2 best models #3222

Open edesz opened 2 years ago

edesz commented 2 years ago

Link to the notebook Add the link to the notebook.

HPO Notebook: https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/tuning-step/sagemaker-pipelines-tuning-step.ipynb

This is a really useful notebook (especially for beginners). Thanks for putting this together!

Describe the bug A clear and concise description of what the bug is.

I am trying to evaluate multiple models in the model evaluation step of the abalone pipeline in this notebook.

In this notebook, the top 2 models are created in the step before model evaluation (see best_model = Model(...) and second_best_model = Model()).

I want to evaluate both of these models. Currently, the notebook only evaluates the best model. There is a comment in the notebook

# This can be extended to evaluate multiple models from the HPO step

I am trying to modify the model evaluation step of the pipeline to do this.

To reproduce A clear, step-by-step set of instructions to reproduce the bug.

In order to evaluate 2 models, I changed the following block of code

step_eval = ProcessingStep(
    name="EvaluateTopModel",
    processor=script_eval,
    inputs=[
        ProcessingInput(
            source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="evaluate.py",
    property_files=[evaluation_report],
    cache_config=cache_config,
)

I replaced the above block of code by the following

inputs=[
        ProcessingInput(
            source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_tuning.get_top_model_s3_uri(top_k=1, s3_bucket=model_bucket_key),
            destination="/opt/ml/processing/model",
        ),

since k = 1 will get the second-best model. In evaluate.py, I replaced

model_path = "/opt/ml/processing/model/model.tar.gz"
with tarfile.open(model_path) as tar:
    tar.extractall(path=".")
logger.debug("Loading xgboost model.")
model = pickle.load(open("xgboost-model", "rb"))

by

model1_path = "/opt/ml/processing/model/model.tar.gz"
with tarfile.open(model1_path) as tar:
    tar.extractall(path="./model1")

model2_path = "/opt/ml/processing/model/model.tar.gz"
with tarfile.open(model1_path) as tar:
    tar.extractall(path="./model2")

logger.debug("Loading xgboost model.")
model1 = pickle.load(open("model1/xgboost-model", "rb"))
model2 = pickle.load(open("model2/xgboost-model", "rb"))

Logs If applicable, add logs to help explain your problem. You may also attach an .ipynb file to this issue if it includes relevant logs or output.

I first tried to see if I could simply change k=0 to k=1 (with no other changes anywhere) and get the pipeline to run successfully. When I run the pipeline by

I get the following error in the Model Evaluation step

>>> execution = pipeline.start()
>>> execution.list_steps()

{'StepName': 'EvaluateTopModel',
  'StartTime': datetime.datetime(...),
  'EndTime': datetime.datetime(...),
  'StepStatus': 'Failed',
  'AttemptCount': 0,
  'FailureReason': 'ClientError: Cannot access S3 key.',
  'Metadata': ...}

This suggests to me that it cannot access the s3 key for k=1. This is confusing since 2 models are clearly being created in the previous steps of the pipeline (step_create_first and step_create_second).

Second, when I run the pipeline with the exact modifications I have shown in the To reproduce section above

I get an error about not being able to handle duplicate keys. Presumably, it is having trouble with destination="/opt/ml/processing/model", being the same in both step inputs (k=0 and k=1).

Question I seem to be having some trouble adding a second model to the model evaluation step of this pipeline. How can I modify this pipeline to evaluate the top 2 models?

friendlywookiee commented 1 year ago

I ran into the same issue. However, I do not think this is related to evaluating several models. I get the following FailureReason:

'FailureReason': 'ClientError: Cannot access S3 key: bucket-name/key-of-json-file

The error does not state an s3 URI, which should start with s3://, or an s3 key, which does not contain the bucket name. This is strange, from my point of view.

Another interesting fact is that the processing job succeeded. However, the pipeline step of the specific processing job failed.

friendlywookiee commented 1 year ago

For me, it has been a permission error. The pipeline role (not the job role) could not read the s3 key. So I added a policy that allows s3:get_object to resource bucket-name/* (only bucket-name does not work because it does not affect the objects inside the bucket).

stegarth commented 11 months ago

@friendlywookiee I'm encountering the same error. Did the permissions update resolve the issue for you? If so, what were the specific changes to the policy you made?

Thanks for your help!

friendlywookiee commented 11 months ago

@stegarth

I added this policy: For obvious reasons it's not the optimal approach regarding least-privilege. { "Sid": "VisualEditor1", "Effect": "Allow", "Action": "s3:", "Resource": "arn:aws:s3:::/” }