aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.14k forks source link

sagemaker.model_monitor.DefaultModelMonitor suggest_baseline is not able to read Japanese text #4822

Closed johansew closed 3 months ago

johansew commented 3 months ago

Describe the bug When creating statistics and constraints with DefaultModelMonitor.suggest_baseline for a UTF-8 encoded CSV containing Japanese text, the column names and categorical values are all appeared as ????? in the output JSON, making it unuseable.

To reproduce A clear, step-by-step set of instructions to reproduce the bug. The provided code need to be complete and runnable, if additional data is needed, please include them in the issue. Create a CSV dataset with Japanese columns name, and categorical values in Japanese.

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600, 
)

my_default_monitor.suggest_baseline(
    baseline_dataset="baselining_data_set.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=output_s3_uri,
)

Check the statistics.json and constraints.json created, it will show ?????? for Japanese text

{
  "version" : 0.0,
  "features" : [ {
    "name" : "????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "???????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "???????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }

Expected behavior Correctly showing Japanese text.

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

johansew commented 3 months ago

The issue persists even when using MonitoringDatasetFormat.parquet format.

johansew commented 3 months ago

Issue is due to processing inside Sagemaker Clarify container, not related to Sagemaker Python SDK.