This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.63k
stars
2.84k
forks
source link
mltable.from_delta_lake() results in invalid table version error when specifying timestamp_as_of-parameter #38155
mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest (image used for Azure ML component)
Python 3.10
Describe the bug
Attempting to read a delta table as of a specified timestamp using mltable.from_delta_lake() within an Azure ML component results in an error saying that the table version is invalid. I am able to read from the delta table when not specifying the timestamp_as_of-parameter and when specifying a valid version_as_of-parameter.
Message: rslex failed
Payload: {"pid": 14, "MLTableVersion": "1.6.1", "rslex_version": "2.22.4", "version": "5.1.6"}
Traceback (most recent call last):
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/mltable/mltable.py", line 1300, in to_pandas_dataframe
return get_dataframe_reader().to_pandas_dataframe(self._dataflow)
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 355, in to_pandas_dataframe
return _execute(
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 266, in _execute
raise e
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 246, in _execute
return rslex_execute()
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 177, in rslex_execute
(batches, num_partitions, stream_columns) = executor.execute_dataflow(dataflow,
File "/azureml-envs/azureml_d11310221f6c7840c8c915e471f53b45/lib/python3.10/site-packages/azureml/dataprep/api/_rslex_executor.py", line 26, in execute_dataflow
(batches, num_partitions, stream_columns) = Executor().execute_dataflow(script,
azureml.dataprep.api.errorhandlers.ExecutionError:
Error Code: ScriptExecution.Unexpected
Native Error: Dataflow visit error: ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None })
VisitError(ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None }))
=> Failed with execution error: Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421
ExecutionError(ExternalError { message: "Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421", source: None })
Error Message: Got unexpected error during execution: Error when opening delta table using timestamp=2024-10-28T12:00:00Z: Invalid table version: 3421. .| session_id=l_23d4b51d-4cfa-4002-9169-934806d32b26
Expected behavior
The behaviour is likely dependent on the actual delta table. In our case, the invalid table version mentioned in the error message above (version 3421) is no longer accessible in the delta table history, where the earliest accessible version is 5152, i.e. many versions after the supposed invalid table version. When not specifying the timestamp parameter I am also able to read from the delta table using mltable, so I know that the uri for the delta table is also correct.
I am also able to read with the specified timestamp via spark, so I know that the timestamp itself is also valid and within the delta table history. For the timestamp in question (2024-10-28T12:00:00Z), I expect the corresponding table version to be 6783, so it seems weird to me that that mltable should care about version 3421 at all.
Describe the bug Attempting to read a delta table as of a specified timestamp using
mltable.from_delta_lake()
within an Azure ML component results in an error saying that the table version is invalid. I am able to read from the delta table when not specifying thetimestamp_as_of
-parameter and when specifying a validversion_as_of
-parameter.To Reproduce Steps to reproduce the behavior:
df = mltable.from_delta_lake(delta_uri, timestamp_as_of="2024-10-28T12:00:00Z").to_pandas_dataframe()
Copy of error below:
Expected behavior The behaviour is likely dependent on the actual delta table. In our case, the invalid table version mentioned in the error message above (version 3421) is no longer accessible in the delta table history, where the earliest accessible version is 5152, i.e. many versions after the supposed invalid table version. When not specifying the timestamp parameter I am also able to read from the delta table using mltable, so I know that the uri for the delta table is also correct.
I am also able to read with the specified timestamp via spark, so I know that the timestamp itself is also valid and within the delta table history. For the timestamp in question (2024-10-28T12:00:00Z), I expect the corresponding table version to be 6783, so it seems weird to me that that mltable should care about version 3421 at all.