delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.32k stars 406 forks source link

Not able to access Azure Delta Lake #600

Closed ganesh-gawande closed 1 year ago

ganesh-gawande commented 2 years ago

Discussed in https://github.com/delta-io/delta-rs/discussions/599

Originally posted by **ganesh-gawande** May 9, 2022 Hi, I am using the documentation - https://github.com/delta-io/delta-rs/blob/main/docs/ADLSGen2-HOWTO.md I tried many version of paths - but not able to access the Delta lake. Following error received - **Not a Delta table: No snapshot or version 0** found OR **Invalid object URI** Here are the paths I have tried in my code but nothing works. ``` delta = DeltaTable("adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net") delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/{Folder1}/{Folder2}/{FileName}.parquet") delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/{DeltaTableNameFromDatabricks}") delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/") delta = DeltaTable("adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/{DeltaTableNameFromDatabricks}") delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/{DeltaTableNameFromDatabricks}") delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/") delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/") ```
roeap commented 2 years ago

Hi @ganesh-gawande, tahnks for this report. - could you share the basic layout of your storage account? Specifically where the _delta_logs folder is located?

ganesh-gawande commented 2 years ago

Hi @roeap - Storage account -> Container -> Delta log folder

roeap commented 2 years ago

not sure i follow. This means, the delta table is located at the root of the container within the storage account? In your example you mention DeltaTableNameFromDatabricks. Does this then only refer to the name given in the metadata?

Also, would it be possible to share the (possibly redacted) contents of the 00000000000000000000.json file from the delta log?

ganesh-gawande commented 2 years ago

Yes. Delta table is located at root of container of storage account. This means - when I open the container in storage explorer - I can see_delta_logs folder.

More info - I have partitioned the delta table - based on Id and date. So where the _delta_logs folder is present, at same level other folders with id present. Within that folder date folder is present - and within that .parquet files are present. e.g. Folder heirarchy is like below Container

I don't see the json file with name you mentioned. I can see multiple json files with different numbers. e.g. 00000000000000086096.json, 00000000000000086097.json, 00000000000000086098.json etc etc.

roeap commented 2 years ago

Thanks! In that case "adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net" should be the way to go. The fact that we are getting the error you mentioned should already mean, that we can access the storage, but it seems we are not finding the entry point to the logs. AFAIK the log files should never be deleted and the all zero file i mentioned denotes the initial commit. I have to look a bit into our codebase to remind me how delta-rs starts parsing the log.

Given the number of commits, there should also be a file called _last_checkpoint, which should point to a parquet file containing all commit info up to that checkpoint and allows us to avoid parsing the log from the very beginning. If that file does not exist and we cannot find the aforementioned file, the error message you mentioned is shown.

So my question is, does that _last_checkpoint file exist, and if so, does it point to an exisiting parquet file in the log folder? In any case, the 00000000000000000000.json file not existing might already be considered a corrupt delta log, even though it should work as long as as checkpoint files exist with all relevant information.

@houqp - is that correct?

ganesh-gawande commented 2 years ago

Thank for insights.

This still not working - adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net

I have checked my other delta lake storage - logs folder as well - but did not find 00000000000000000000.json file.

Having said that, in all the delta lake log folders - I can locate the checkpoint file like this - 00000000000000085996.checkpoint.parquet which contains multiple records containing commit info.

Let me know when you take look at the code to check how delta-rs starts parsing the logs.

roeap commented 2 years ago

just making sure - does the _last_checkpoint file exist?

the logic is roughly like this .. check if we find the _last_checkpoint file, if yes, start from that checkpoint, if not, start from the zeros file... if that is not found you see the message you posted in the beginning...

ganesh-gawande commented 2 years ago

No file with name as _last_checkpoint.

The name of file is like 00000000000000085996.checkpoint.parquet

This is only file with name as checkpoint

More information: I am writing this data to this Delta lake from the Databricks notebook in delta format.

roeap commented 2 years ago

hmm strange ... this seems like a corruption in the delta log to me... when databricks creates a checkpoint it should also create a _last_checkpoint file. The rust implementation relies on either identifying the latest checkpoint via that file or starting from the beginning.

One way to load the table could be to use the load_version function i.e. table.load_version(85996). Looking at the delta sepcification really quick it seems to me this scenario (i.e. parts of the log missing) is not something a reader needs to support, but would be resilient to if the last checkpoint file exists.

If you use the load_version command mentioned above we search for the closest checkpoint with a lower or equal version and that "should" work. So it should work with any version higher then that checkpoints version. THe reasoing for all that logic is tables exactly like yours, where listing a directory with 10s of thousands of files becomes prohibitively expensive...

I'd be interested to know if databricks is able to load that table without specifying a specific version.

ganesh-gawande commented 2 years ago

I am able to query the delta table correctly from the Databricks notebooks. I hope that is also working on the delta logs and latest checkpoint concept. So if those delta logs are corrupted then that might have cause problem when same delta table is getting queried from the databricks notebook. Does this understanding correct?

I have following configuration in Databricks cluster spark configuration. Does this have any relation to modify names of the delta log files or checkpoint files.

spark.databricks.delta.symlinkFormatManifest.fileSystemCheck.enabled false spark.databricks.delta.preview.enabled true spark.databricks.delta.schema.autoMerge.enabled true

ganesh-gawande commented 2 years ago

In addition to test the corruption of delta logs - I have created a new Azure Storage account - and created delta table pointing to it.

Now I can see the JSON file 00000000000000000000.json in _delta_log folder. image

I have updated the configuration my python code where I am using delta-rs package. delta = DeltaTable("adls2://sample@sampledeltalakestorage.dfs.core.windows.net")

In above code, sample is container name and sampledeltalakestorage is my storage account name.

I am still getting the same error - Failed to read delta log object: Invalid object URI

wjones127 commented 2 years ago

In any case, the 00000000000000000000.json file not existing might already be considered a corrupt delta log, even though it should work as long as as checkpoint files exist with all relevant information.

@roeap Actually, the first delta entry is not guaranteed to exist. See my update in https://github.com/delta-io/delta/pull/913

Not sure if we are testing that in this repo though.

roeap commented 2 years ago

@ganesh-gawande - sorry for sending you on a wild goose chase ... I recreated the behaviour locally. Seems like we have a bug where we cannot read a table from the root of a container. I opened #602 to track this and should soon be able to get to this.

Actually, the first delta entry is not guaranteed to exist

@wjones127 would we then expect that the _last_checkpoint file exists to discover a checkpoint, or do we have to rely on lexicographical sort to find the file? We should definitely have a test for this, not sure that we do. If we can do without the last checkpoint file we should have a closer look at the code path - not sure that we could handle that.

wjones127 commented 2 years ago

would we then expect that the _last_checkpoint file exists to discover a checkpoint

It's vague in the protocol, but I don't think it necessarily exists. (And it's also not guaranteed to point to the most recent checkpoint.) We probably shouldn't rely on it 😢

roeap commented 2 years ago

@ganesh-gawande - so the path you should be using is adls2://{StorageAccountName}/{ContainerName}/. After #603 is merged adls2://{StorageAccountName}/{ContainerName} should also work.

However I also tried loading a delta log with initial commit files remove, which only work if there is a _last_commit file present. When that file is missing we see the exact error message you encountered.

@wjones127 @houqp - I do remember the protocol explicitly mentioning lexicographical sort to work with the log. Should we implement that logic, or make sure first that delta needs to support finding checkpoints w/o that file. or are we already sure :).

I guess the core logic from loading a specific version can already largely be reused. Likely we would also want to mirror the logic in our writers to create a checkpoint every ten commits.

ganesh-gawande commented 2 years ago

@roeap - As per you suggestions - I tried this path adls2://{StorageAccountName}/{ContainerName}/ and its not working.

Do I need to update the version using pip or something else I am missing? or shall I need to wait for both #602 and #603 need to be merge?

roeap commented 2 years ago

kind of lost track... just making sure, so to clarify

Did you also try the test table you created which DOES have 00000000000000000000.json?

ganesh-gawande commented 2 years ago

Yes. If you see the message which has screenshot - there I have tried to create new storage account and new delta table and which has 00000000000000000000.json file as well.

Still I am not able to access it. I am getting error as - Not a Delta table: No snapshot or version 0 found, perhaps adls2://sampledeltalakestorage/sample is an empty dir?

roeap commented 2 years ago

also if you add a trailing "/"? This is actually the bug that gets fixed in #603. i.e. if the trailing slash is missing it also fails for me ...

ganesh-gawande commented 2 years ago

Yes. Although I add / at last or do not add it, still getting same error.

roeap commented 2 years ago

at which point do you see the error? I just tried with the released python package, I can load the table, get metadata, history etc, but am also seeing an error when trying to materialize the dataset.

could you try something like

table = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/")
table.pyarrow_schema()

That would help narrow down the source of the error.

ganesh-gawande commented 2 years ago

Can you please confirm once the version you are using ? Or point out to URL to download latest version so that I will check again ?

roeap commented 2 years ago

i used the released version 0.5.7

ganesh-gawande commented 2 years ago

When I tried to install or upgrade new version - I am getting following errors. See the first line and last line. It start download of 0.5.7 - but at the last line - it installs 0.5.6. Any thoughts on this?

I have upgraded the pip as well to pip 21.3.1

Collecting deltalake Using cached deltalake-0.5.7.tar.gz (4.3 MB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\python.exe' 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\lib\site-packages\pip_vendor\pep517\in_process_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\G4071~1.GAW\AppData\Local\Temp\tmprw4bi353' cwd: C:\Users\G4071~1.GAW\AppData\Local\Temp\pip-install-0jujfpje\deltalake_4eba21a2f36d4e65952f26952f2d7b48
Complete output (6 lines):

Cargo, the Rust package manager, is not installed or is not on PATH. This package requires Rust and Cargo to compile extensions. Install it through the system's package manager or via https://rustup.rs/

Checking for Rust toolchain....

WARNING: Discarding https://files.pythonhosted.org/packages/4f/6c/fe7dafb8e4fed25e97652c1ab1bbd73ae4fb1f32881abc73e9dcaabe1167/deltalake-0.5.7.tar.gz#sha256=b14f7417f72fa363519e7080ed9c99f4fc31f93a8af8428fae2370f090297bc6 (from https://pypi.org/simple/deltalake/) (requires-python:>=3.6). Command errored out with exit status 1: 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\python.exe' 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\lib\site-packages\pip_vendor\pep517\in_process_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\G4071~1.GAW\AppData\Local\Temp\tmprw4bi353' Check the logs for full command output. Using cached deltalake-0.5.6-cp36-abi3-win_amd64.whl (6.4 MB) Requirement already satisfied: pyarrow>=4 in c:\users\g.gawande\appdata\local\programs\python\python36\lib\site-packages (from deltalake) (6.0.1) Requirement already satisfied: dataclasses in c:\users\g.gawande\appdata\local\programs\python\python36\lib\site-packages (from deltalake) (0.8) Requirement already satisfied: numpy<1.20.0 in c:\users\g.gawande\appdata\local\programs\python\python36\lib\site-packages (from deltalake) (1.19.5) Installing collected packages: deltalake Successfully installed deltalake-0.5.6

roeap commented 2 years ago

seems like your system is choosing a source distribution in case of 0.5.7, while using a pre-compiled wheel in case of 0.5.6. As you seem to not have cargo (i.e. the rust toolchain) installed on your system its failing to build the package locally. As to why that is the case I am not sure, as I am no too familiar with how python (or pip) makes a choice which install mehtod / artifact to choose.

roeap commented 2 years ago

@ganesh-gawande - i was able to dig into loading the table from python. Turns out it was #602 which also caused the issue, since the trailing slash gets truncated when we initialize the file system internally.

I was able to load the table using the following workaround.

from deltalake import DeltaTable
from deltalake.fs import DeltaStorageHandler
import pyarrow.fs as pa_fs

path = "adls2://{StorageAccountName}/{ContainerName}/"

table = DeltaTable(path)
filesystem = pa_fs.PyFileSystem(DeltaStorageHandler(path))

ds = table.to_pyarrow_dataset(filesystem=filesystem)
roeap commented 2 years ago

@ganesh-gawande is this still relevant, or did you manage to resolve this?

ganesh-gawande commented 2 years ago

@roeap

I am able to upgrade the release version to 0.5.7. (For that I need to upgrade my python version to 3.8.10 and then upgrade the pip and then I was able to upgrade it to 0.5.7).

Now After that - I am trying to use the code you have given in https://github.com/delta-io/delta-rs/issues/600#issuecomment-1125647640

I have created a new Delta Lake - at the root of the container - which has folder _delta_log and within that - 00000000000000000000.json file exists

Still I am getting the issue - Not a Delta table: No snapshot or version 0 found. Please find attached screenshot for the same.

image

ganesh-gawande commented 2 years ago

@roeap - Please do let me know if you need more information on this? I can share the code file and associated storage account keys with you offline if you want to check it once.

ganesh-gawande commented 2 years ago

@roeap - Any luck on this issue?

roeap commented 2 years ago

@ganesh-gawande - i may have stumbled across a bug in the azure sdks. need to investigate a bit more but hopefully can supply a fix soon.

roeap commented 2 years ago

@ganesh-gawande - a lot of fixes are included now on main - could you confirm that you can read your table?

In case you don't want to build from main - you have to wait a bit more until 0.5.8 is released (#640)

ganesh-gawande commented 2 years ago

@roeap Thank you very much for the fix. I will try to checkout the main and try to build. Do you have any documentation/step how to build this repo?

Meanwhile any idea when 0.5.8 would get released ? any timelines?

roeap commented 2 years ago

should be very shortly - the release PR just needs to be updated and merged, but all pending work that we wanted to include is on main.

roeap commented 2 years ago

@ganesh-gawande - the new python bindings are released, could you check if that mitigates your error?

ganesh-gawande commented 2 years ago

@roeap - Unfortunately I am getting the same error as earlier. I have removed the earlier deltalake package version and installed new 0.5.8. Sharing all the details again here for your quick reference.

Here is my Azure storage structure, the storage account name and container name.

Stroage Structure

Here are contents of _delta_log folder

Delta_log Folder

Here is code snippet I am using - as per shared by you.

from deltalake import DeltaTable
from deltalake.fs import DeltaStorageHandler
import pyarrow.fs as pa_fs
import os

path = "adls2://sampledeltalakestorage/sample/"
table = DeltaTable(path)

I am getting following error -

File "C:\Users\g.gawande\AppData\Local\Programs\Python\Python38\lib\site-packages\deltalake\table.py", line 90, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Not a Delta table: No snapshot or version 0 found, perhaps adls2://sampledeltalakestorage/sample is an empty dir?
s-suryakiran-sureshkumar commented 2 years ago

@roeap - Unfortunately I am getting the same error as earlier. I have removed the earlier deltalake package version and installed new 0.5.8. Sharing all the details again here for your quick reference.

Here is my Azure storage structure, the storage account name and container name. Stroage Structure

Here are contents of _delta_log folder Delta_log Folder

Here is code snippet I am using - as per shared by you.

from deltalake import DeltaTable
from deltalake.fs import DeltaStorageHandler
import pyarrow.fs as pa_fs
import os

path = "adls2://sampledeltalakestorage/sample/"
table = DeltaTable(path)

I am getting following error -

File "C:\Users\g.gawande\AppData\Local\Programs\Python\Python38\lib\site-packages\deltalake\table.py", line 90, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Not a Delta table: No snapshot or version 0 found, perhaps adls2://sampledeltalakestorage/sample is an empty dir?

I am also getting the same issue.

Here is code snippet that I used:

from deltalake import DeltaTable
import os

path = "adls2://{storage_account_name}/{container_name}/{delta_table}/"
table = DeltaTable(path)

I am getting following error -

    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Not a Delta table: No snapshot or version 0 found

Is there any solution for the issue?

roeap commented 2 years ago

Hmm, this is a bit puzzling. Could you remind me which authorization mechanism you are using, and also validate that you can read and list on that account?

During tests as well as at my work we successfully working with tables stored in azure, and there really is not much more one can do other then provide proper authjorization. Given your code snipplet (and probably discussed above somewhere :)) I assume you are using environment variables, right?

There were some issus in azure lsit, that have been fixed recently, but reading a table with the initial version file present should not require any list operation ... In any case, if you can build off main and see if that works, that is always helpful :).

michaelenew commented 2 years ago

@roeap I am hitting this same error message today trying to write to S3 and local with deltalake.writer.write_deltalake. The error message appears when passing a DeltaTable object, but not when passing a string. Hopefully this helps with a repro case here.

// Error deltalake.PyDeltaTableError: Not a Delta table: No snapshot or version 0 found...
import pyarrow
import deltalake
deltalake.writer.write_deltalake(
    deltalake.DeltaTable('/path/to/my/table'),
    pyarrow.RecordBatch.from_pylist([{f'col{i}': i for i in range(5)}]),
    mode = 'append'
)
// Runs as expected
import pyarrow
import deltalake
deltalake.writer.write_deltalake(
    '/path/to/my/table',
    pyarrow.RecordBatch.from_pylist([{f'col{i}': i for i in range(5)}]),
    mode = 'append'
)
roeap commented 2 years ago

@michaelenew - thanks for the report! Currently writing to remote stores is not supported using write_deltalake. However we are actively working on integrating a new object store implementation also with the intent to enable this.

cdena commented 2 years ago

Update: I think something has changed in 0.6.0 and the docs aren't published yet. I re-installed with target version 0.5.8 and adls2 worked for reading the table. I'll wait for the 0.6.0 docs to see what may have changed and try with the newer version.

@roeap I am running the latest release 0.6.0 and also running into the azure error "Not a Delta table: No snapshot or version 0 found, perhaps adls2://accountname/sandbox/taxi_data is an empty dir?"

I have pulled the delta table directory down locally and run the DeltaTable call and it works fine.

I have double checked that the ENV entries are coming through. If I remove them then the error output complains about missing auth. I can also use azure.storage.blob with the key and account name and list every file. Below is a summary.

#works
path = r'C:/temp/taxi_data'
delta = DeltaTable(path)
dataFrames = delta.to_pyarrow_table().to_pandas()
print(dataFrames)

#Doesn't Work - gives  error:  Not a Delta table: No snapshot or version 0 found
delta = DeltaTable('adls2://accountname/sandbox/taxi_data/')
dataFrames = delta.to_pyarrow_table().to_pandas()

#Below Sort of Works -  DeltaTable runs,  subsequent call to_pyarrow_table  fails..    Exception has occurred: PyDeltaTableError
Object at location abfss:/container@account.dfs.core.windows.net/taxi_data/part-00000-19f3445e-57f8-49ea-8de1-1c496515a52d-c000.snappy.parquet not found: response error "<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.

delta = DeltaTable('abfss://container@account.dfs.core.windows.net/taxi_data/')   <---- runs here
dataFrames = delta.to_pyarrow_table().to_pandas()  <------- fails on this line
roeap commented 2 years ago

@ganesh-gawande @cdena - we have released 0.6.1, could you check if this work for you now? Also we now recommend different table identifiers - https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table

iamkk11 commented 2 years ago

@michaelenew - thanks for the report! Currently writing to remote stores is not supported using write_deltalake. However we are actively working on integrating a new object store implementation also with the intent to enable this.

When will writing to remote stores be supported? @roeap

roeap commented 2 years ago

Current main already supports writing to remote stores, at least experimentally, We still have to iron out some bugs and do more integration testing, but the next release will have some initial support. I assume the next release will happen within the next few weeks.

iamkk11 commented 2 years ago

Current main already supports writing to remote stores, at least experimentally, We still have to iron out some bugs and do more integration testing, but the next release will have some initial support. I assume the next release will happen within the next few weeks.

may I please have an example of how this works? How and where do I pass in the auth parameters to remote azure blob storage?

roeap commented 2 years ago

in the docs is an example, as well as a link to the available azure options.

The same storage options can also be passed to the write_deltalake function. If you pass the table itself, the tables storage should be used, and its not requires to pass in the options separately.

iamkk11 commented 2 years ago

@roeap I am doing the below:

path = "az://helios-poc/bronze/mulesoft_dev"
dt = DeltaTable(path, storage_options=storage_options)
write_deltalake(dt, df, mode='append')

And getting the below error:

deltalake.PyDeltaTableError: Failed to read delta log object: Generic MicrosoftAzure error: Account must be specified

roeap commented 2 years ago

You are missing a setting for AZURE_STORAGE_ACCOUNT_NAME :).

By the way, Azure unfortunately is somewhat "special" here. Usually, I would recommend using one of the pyarrow.fs filesystems, as they require less transmission of data between language barriers. For azure though no native arrow implementation exists, due to C++ version incompatibilities.

In case you have the capacity, I would be really interested to see the difference in performance between the way you specified it (which uses the rust filesystem wrapped in several translation layers) and using adlfs, wrapped in a SubTreeFileSystem which still needs to pass language barriers, but fewer...

Eventually though we will indentify the bottenecks and bring up performance.

iamkk11 commented 2 years ago

I do not think AZURE_STORAGE_ACCOUNT_NAME is the issue because I have specified it and I can even get the delta table version. @roeap

storage_options = {}
    storage_options['AZURE_STORAGE_ACCOUNT_KEY'] = "XXX"
    storage_options['AZURE_STORAGE_ACCOUNT_NAME'] = "XX"
    storage_options['AZURE_STORAGE_CONNECTION_STRING'] = "XX"
    storage_options['AZURE_STORAGE_CLIENT_ID'] = "XX"
    storage_options['AZURE_STORAGE_CLIENT_SECRET'] = "XX"
    storage_options['AZURE_STORAGE_TENANT_ID'] = "XXX"
roeap commented 2 years ago

just to clarify, you are using a build of main, or a released version? The released versions right now do not yet really support writing to remote storages.

If you are using main, this may be a bug, but it may be able to circumvent that, by setting AZURE_STORAGE_ACCOUNT_NAME in the environment. The error itself originates in the file system builder, and tells us that specific config is missing.