Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.63k stars 2.84k forks source link

mltable.from_delta_lake() results in invalid table version error when specifying timestamp_as_of-parameter #38155

Open matiassiv opened 3 weeks ago

matiassiv commented 3 weeks ago

Describe the bug Attempting to read a delta table as of a specified timestamp using mltable.from_delta_lake() within an Azure ML component results in an error saying that the table version is invalid. I am able to read from the delta table when not specifying the timestamp_as_of-parameter and when specifying a valid version_as_of-parameter.

To Reproduce Steps to reproduce the behavior:

  1. df = mltable.from_delta_lake(delta_uri, timestamp_as_of="2024-10-28T12:00:00Z").to_pandas_dataframe()

Copy of error below:

Message: rslex failed
Payload: {"pid": 14, "MLTableVersion": "1.6.1", "rslex_version": "2.22.4", "version": "5.1.6"}
Traceback (most recent call last):
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/mltable/mltable.py", line 1300, in to_pandas_dataframe
    return get_dataframe_reader().to_pandas_dataframe(self._dataflow)
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 355, in to_pandas_dataframe
    return _execute(
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 266, in _execute
    raise e
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 246, in _execute
    return rslex_execute()
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 177, in rslex_execute
    (batches, num_partitions, stream_columns) = executor.execute_dataflow(dataflow,
  File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_rslex_executor.py", line 26, in execute_dataflow
    (batches, num_partitions, stream_columns) = Executor().execute_dataflow(script,
azureml.dataprep.api.errorhandlers.ExecutionError: 
Error Code: ScriptExecution.Unexpected
Native Error: Dataflow visit error: ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None })
    VisitError(ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None }))
=> Failed with execution error: Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421
    ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None })
Error Message: Got unexpected error during execution: Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421. .| session_id=l_23d4b51d-4cfa-4002-9169-934806d32b26

Expected behavior The behaviour is likely dependent on the actual delta table. In our case, the invalid table version mentioned in the error message above (version 3421) is no longer accessible in the delta table history, where the earliest accessible version is 5152, i.e. many versions after the supposed invalid table version. When not specifying the timestamp parameter I am also able to read from the delta table using mltable, so I know that the uri for the delta table is also correct.

I am also able to read with the specified timestamp via spark, so I know that the timestamp itself is also valid and within the delta table history. For the timestamp in question (2024-10-28T12:00:00Z), I expect the corresponding table version to be 6783, so it seems weird to me that that mltable should care about version 3421 at all.

github-actions[bot] commented 3 weeks ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

github-actions[bot] commented 3 weeks ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.