aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

`RegisterModel` fail with Framework = 'None' when provided with Huggingface Estimator and Model but not ECR image URI directly #4225

Closed Adamwgoh closed 10 months ago

Adamwgoh commented 11 months ago

Describe the bug RegisterModel fail when not provided with image_uri, even when HuggingFaceModel has all the given specified version needed to query for HuggingFace AWS ECR Inference Container To reproduce

    model = HuggingFaceModel(
        model_data="s3://finbert-tone/finbert.tar.gz",
        entry_point="./hf_scripts/run_glue.py",
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1',
        py_version='py39',
    )

    hf_est = HuggingFace(
        model_data="s3://finbert-tone/finbert-tone/modeloutput",
        entry_point='run_glue.py',
        source_dir='./hf_scripts',
        checkpoint_s3_uri="s3://finbert-tone/finbert-tone/modeloutput",
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1',
          # image_uri=image_uri,
        py_version='py39',
        sagemaker_session=local_sess,
        hyperparameters = hyperparameters
    )

    step_register = RegisterModel(
        name="Register-Finbert-Tone-Model",
        estimator=hf_est,
        model=model,
        model_data="s3://my-huggingface-model/model.tar.gz",
        content_types=["text/csv"],
        response_types=["text/csv"],
        inference_instances=["ml.t2.medium", "ml.m5.large"],
        transform_instances=["ml.m5.large"],
        model_package_group_name=model_package_group_name,
        approval_status="PendingManualApproval",
        framework="PYTORCH",
        framework_version="1.13.1",
        entry_point="./hf_scripts/run_glue.py",
        source_dir="./hf_scripts",
    )

    pipeline = Pipeline(
            name=pipeline_name,
            steps=[step_register],
            sagemaker_session=local_sess,
        )
    upsert_response = pipeline.upsert(role_arn=servicerole, description="test")
    print(f"upsert response: {upsert_response}")
    execution = pipeline.start()

Based on the following pipeline steps, I realize that unless I provide a model_uri directly (such as using get_huggingface_llm_image_uri, it will not be parse framework and framework_version over to RegisterModelStep, and cause an error. When we run HuggingfaceModel(() directly, it will query based on framework, pytorch_version etc to get the correct inference container. However, I seem to struggle with doing the same with the following pipeline.

Is there reason why framework is excluded from being parsed when image_uri does not exist, under sagemaker/workflow/_utils.py: _RegisterModelStep?

Expected behavior Error is: ValueError: Unsupported base framework: None. You may need to upgrade your SDK version (pip install -U sagemaker) for newer base frameworks. Supported base framework(s): version_aliases, pytorch1.13.1, tensorflow2.11.0. Expected behavior is to parse them for huggingface and pull respective ECR inference container

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

qidewenwhen commented 11 months ago

Hi @Adamwgoh, thanks for reaching out! As indicated in this doc, RegisterModel is to be deprecated and we're not supporting it. Could you try ModelStep and let us know if it can help in your use case?

You may need to slightly update your code to be something like:

  model = HuggingFaceModel(
        model_data="s3://finbert-tone/finbert.tar.gz",
        entry_point="./hf_scripts/run_glue.py",
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1',
        py_version='py39',
        sagemaker_session=pipeline_session,  # !!! the session must be a PipelineSession object
    )

 step_register_args = model.register(
        content_types=["text/csv"],
        response_types=["text/csv"],
        inference_instances=["ml.t2.medium", "ml.m5.large"],
        transform_instances=["ml.m5.large"],
        model_package_group_name=model_package_group_name,
        approval_status="PendingManualApproval",
        framework="PYTORCH",
        framework_version="1.13.1",
       # any other args you need to pass to the register method
  )

  model_step = ModelStep(
        name="MyModelStep",
        step_args=step_register_args,
    )
martinRenou commented 10 months ago

@Adamwgoh Did it fix your issue? Shall we close this?

Adamwgoh commented 10 months ago

@Adamwgoh Did it fix your issue? Shall we close this?

Hi @martinRenou I have yet to test it. Let me do test it by next week and get back to you if that's okay ?

martinRenou commented 10 months ago

Sure! Thank you for reaching back

Adamwgoh commented 10 months ago

@martinRenou tested this and the new ModelStep works. Thanks