aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

Can't use `record_set()` to create data for RCF "test" channel #2925

Closed phschimm closed 9 months ago

phschimm commented 2 years ago

Describe the bug The method sagemaker.RandomCutForest.record_set() can't be used to create a RecordSet for the "test" channel of the RCF algorithm.

To reproduce Configure a RandomCutForest estimator and try fitting it to data ingested via record_set(..., channel='test'):

from sagemaker import RandomCutForest

rcf = RandomCutForest(
    role=execution_role,
    instance_count=1,
    instance_type='ml.m5.large',
    data_location=f's3://{bucket}/{prefix}/',
    output_path=f's3://{bucket}/{prefix}/output',
    num_samples_per_tree=512,
    num_trees=50,
    base_job_name=base_job_name,
    eval_metrics=['accuracy', 'precision_recall_fscore']
)

test_set = rcf.record_set(
    features,
    labels=labels,
    channel='test' # breaking
)

rcf.fit(test_set)

Expected behavior A RecordSet returned by record_set(..., channel='test') should have "S3DataDistributionType": "FullyReplicated".

Screenshots or logs

image

Docker entrypoint called with argument(s): train
Running default environment configuration script
[02/09/2022 18:27:59 INFO 140001573062464] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[02/09/2022 18:27:59 INFO 140001573062464] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '563', 'num_samples_per_tree': '125', 'feature_dim': '71', '_tuning_objective_metric': 'test:f1', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'mini_batch_size': '1000'}
[02/09/2022 18:27:59 INFO 140001573062464] Final configuration: {'num_samples_per_tree': '125', 'num_trees': '563', 'force_dense': 'true', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'epochs': 1, 'mini_batch_size': '1000', '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': 'test:f1', '_ftp_port': 8999, 'feature_dim': '71'}
[02/09/2022 18:27:59 ERROR 140001573062464] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: 'ShardedByS3Key' is not one of ['FullyReplicated']
Failed validating 'enum' in schema['properties']['test']['properties']['S3DistributionType']:
    {'enum': ['FullyReplicated'], 'type': 'string'}
On instance['test']['S3DistributionType']:
    'ShardedByS3Key'

System information A description of your system. Please provide:

Additional context This property is hardcoded in the RecordSet class utilized by record_set():

https://github.com/aws/sagemaker-python-sdk/blob/2ebba8a454de03a2bc49267c91dbacddd6183585/src/sagemaker/amazon/amazon_estimator.py#L340

@mufaddal-rohawala @jeniyat or anyone else: In the meantime, is there any other way to create a RecordSet for RCF from Numpy data?

phschimm commented 2 years ago

I've posted a more detailed investigation about this problem on StackOverflow: https://stackoverflow.com/questions/71053554/why-can-random-cut-forests-record-set-method-for-data-conversion-upload-not

Can someone maybe identify, which API version was used in this post here?

If I had that information, I could downgrade my notebook instance, execute my experiments, and get the quality metrics I need.

natbukowski commented 11 months ago

Hello, I am also experiencing this issue and wanted to know if there is any work around this problem?