Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.53k stars 2.76k forks source link

MLtable save method creates invalid YAML schema #29480

Open leeharper2425 opened 1 year ago

leeharper2425 commented 1 year ago

Describe the bug I am creating a component that converts a CSV uri_file type to an mltable file type for use with the AutoML designer component. My MLTable component is succeeding, and creating the MLtable file. However, it is being created with invalid entries in the read_delimited transformation schema:

paths:

The partition_size field is invalid, and thus this MLtable is failing downstream validation. The error message is:

Encountered user error while fetching data from Dataset. Error: UserErrorException: Message: MLTable yaml schema is invalid: Error Code: Validation Validation Error Code: Invalid MLTable Validation Target: MLTableToDataflow Error Message: Failed to convert a MLTable to dataflow read_delimited transformation does not support some of the provided properties: partition_size. | session_id=9201b4db-6fa0-4745-aef9-dd23aae6a07f InnerException None ErrorResponse { "error": { "code": "UserError", "message": "MLTable yaml schema is invalid: \nError Code: Validation\nValidation Error Code: Invalid MLTable\nValidation Target: MLTableToDataflow\nError Message: Failed to convert a MLTable to dataflow\nread_delimited transformation does not support some of the provided properties: partition_size.\n| session_id=9201b4db-6fa0-4745-aef9-dd23aae6a07f" } }

To Reproduce Steps to reproduce the behavior:

  1. Create a CSV file

  2. Turn it into the uri_file input of an AML designer component

  3. load it through mltable in a script:

    import mltable
    
    # Read file into the component
    path = [{"file": f"{args.input_data}"}]
    tbl = mltable.from_delimited_files(path)
    
    # Write file out of the component
    tbl.save(args.output_data)
  4. Save it immediately with no additional changes

  5. Try to load into an AutoML component.

Expected behavior I expect the saved MLtable to be able to load without any issues. I believe this means that the partition_size field should not be present, per the documentation, upon saving the file.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

mccoyp commented 1 year ago

Hi @leeharper2425, thank you for opening an issue! I'll tag the appropriate team so we can respond as soon as possible. cc: @azureml-github

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github, @Azure/azure-ml-sdk.

Issue Details
- **Package Name**: mltable - **Package Version**: 1.2.0 - **Operating System**: Ubuntu 20.04 - **Python Version**: 3.10.9 **Describe the bug** I am creating a component that converts a CSV uri_file type to an mltable file type for use with the AutoML designer component. My MLTable component is succeeding, and creating the MLtable file. However, it is being created with invalid entries in the read_delimited transformation schema: paths: - file: ../INPUT_input_data/output_data_train transformations: - read_delimited: delimiter: ',' empty_as_string: false encoding: utf8 header: all_files_same_headers include_path_column: false infer_column_types: true partition_size: 20971520 path_column: Path support_multi_line: false type: mltable The partition_size field is invalid, and thus this MLtable is failing downstream validation. The error message is: Encountered user error while fetching data from Dataset. Error: UserErrorException: Message: MLTable yaml schema is invalid: Error Code: Validation Validation Error Code: Invalid MLTable Validation Target: MLTableToDataflow Error Message: Failed to convert a MLTable to dataflow read_delimited transformation does not support some of the provided properties: partition_size. | session_id=9201b4db-6fa0-4745-aef9-dd23aae6a07f InnerException None ErrorResponse { "error": { "code": "UserError", "message": "MLTable yaml schema is invalid: \nError Code: Validation\nValidation Error Code: Invalid MLTable\nValidation Target: MLTableToDataflow\nError Message: Failed to convert a MLTable to dataflow\nread_delimited transformation does not support some of the provided properties: partition_size.\n| session_id=9201b4db-6fa0-4745-aef9-dd23aae6a07f" } } **To Reproduce** Steps to reproduce the behavior: 1. Create a CSV file 2. Turn it into the uri_file input of an AML designer component 3. load it through mltable in a script: ```python import mltable # Read file into the component path = [{"file": f"{args.input_data}"}] tbl = mltable.from_delimited_files(path) # Write file out of the component tbl.save(args.output_data) ``` 5. Save it immediately with no additional changes 6. Try to load into an AutoML component. **Expected behavior** I expect the saved MLtable to be able to load without any issues. I believe this means that the partition_size field should not be present, per the documentation, upon saving the file. **Screenshots** If applicable, add screenshots to help explain your problem. **Additional context** Add any other context about the problem here.
Author: leeharper2425
Assignees: luigiw
Labels: `question`, `Machine Learning`, `Service Attention`, `customer-reported`, `needs-team-attention`
Milestone: -
natehofmann commented 1 year ago

Bug fix should be out for this in next couple days.

xiangyan99 commented 1 year ago

@natehofmann is the fix shipped?

natehofmann commented 1 year ago

Fix was merged but it is a server side change so probably won't be available until later this week? Will update once I know its been shipped & released. In meantime recommend below work around after saving to remove the partition_size argument.

import yaml
...
save_dirc = ...
mltable.save(save_dirc)
with open(f'{save_dirc}/MLTable', 'r+') as f:
    x = yaml.safe_load(f)
    del x['transformations'][0]['read_delimited']['partition_size']
    yaml.safe_dump(x, f)