aws / sagemaker-experiments

Experiment tracking and metric logging for Amazon SageMaker notebooks and model training.
Apache License 2.0
127 stars 36 forks source link

Trial Component: Failed to retrieve model package details. #170

Open mouhannadali opened 2 years ago

mouhannadali commented 2 years ago

Describe the bug After a model is trained and registered, I navigate to the model registry and select the model group name -> model version -> settings. At "Trial Component" row is shows "Failed to retrieve model package details" This issue is appearing just for the approved model versions To Reproduce Steps to reproduce the behavior: trained and registered a model then approve the model

Expected behavior A link to the corresponding "Trail component" should be shown

Screenshots If applicable, add screenshots to help explain your problem. image

Environment: Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Framework Version: Python Version: CPU or GPU: Python SDK Version: Are you using a custom image:

Additional context Add any other context about the problem here.

helinmik commented 2 years ago

I have the same issue when registering the model using Sagemaker Pipeline

mouhannadali commented 1 year ago

any update here?

tmbluth commented 1 year ago

Environment: SageMaker Studio Framework: SageMaker LinearLearner Framework Version: latest as of 2/22/2023 Python Version: 3.7.10 CPU or GPU: CPU Python SDK Version: 2.131.0 Are you using a custom image: No

I'm seeing the same thing. To create the model I'm using some generic code

estimator = Estimator(
        image_uri=image_uri,
        role=role,
        output_path=output_path,
        sagemaker_session=sagemaker_session,
        instance_type=instance_type,
        instance_count=instance_count,
        enable_sagemaker_metrics=True,
        volume_kms_key=use_case_kms_key,
        output_kms_key=use_case_kms_key,
        subnets=subnets,
        security_group_ids=security_group,
        enable_network_isolation=enable_network_isolation,
        encrypt_inter_container_traffic=encrypt_inter_container_traffic,
        tags=tags
)

estimator.set_hyperparameters(
    epochs=epochs,
    l1=l1,
    learning_rate=learning_rate,
    predictor_type=predictor_type
)

Experiment.load(experiment_name=experiment_name)
linear_trial = Trial.create(
    trial_name=trial_name,
    experiment_name=experiment_name,
    sagemaker_boto_client=sm_client,
    tags=tags
)

estimator.fit(
    inputs={
        'train': train_input, 
        'validation':validation_input,
        'test':test_input
    },
    job_name = base_job_name+'-'+mlops_id,
    experiment_config={
        'TrialName': linear_trial.trial_name,
        'TrialComponentDisplayName': 'training',
    },
    wait=True,
    logs=False,
)

Then the path to the model is saved: model_uri = f'{output_path}/{estimator.latest_training_job.job_name}/output/model.tar.gz'

Then to register this model I run this code:

response = sm_client.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription='Model registration testing',
    ModelApprovalStatus='PendingManualApproval',
    InferenceSpecification={
        'Containers': [
            {
                'Image': image_uri,
                'ModelDataUrl': model_uri,
                'NearestModelName': model_name
            },
        ],
        'SupportedTransformInstanceTypes': [inference_instance_type],
        'SupportedContentTypes': ['text/csv'],
        'SupportedResponseMIMETypes': ['text/csv']
    },
    CustomerMetadataProperties={
        'train': training_path, 
        'validation':validation_path,
        'test':testing_path,
        'experiment_name':experiment_name
    },
)

The model successfully uploads to the registry, increments its version, but I get the same error as others.

Is this a bug or are we misusing the Model Registry?

tmbluth commented 1 year ago

After playing around quite a bit I found that registering the model through boto3 like I did in my comment above did not automatically link the TrialComponent, but when using the SageMaker SDK way of registering a model I was able to see the TrialComponent link

model_package = linear_learner.register(
    model_package_group_name=model_package_group_name,
    model_name='linear-learner',
    image_uri=image_uri,
    transform_instances=[instance_type],
    content_types=['text/csv'],
    response_types=['text/csv'],
    approval_status='PendingManualApproval', 
    customer_metadata_properties={
        'train': training_path, 
        'validation':validation_path,
        'test':testing_path,
        'experiment_name':experiment_name
    }
)