delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.21k stars 395 forks source link

feat(python): add DeltaTable.is_deltatable static method (#2662) #2715

Closed omkar-foss closed 2 months ago

omkar-foss commented 2 months ago

closes https://github.com/delta-io/delta-rs/issues/2662.

Description

This adds a static method is_deltatable(path, opts) to the DeltaTable Python class, which returns True if able to locate a delta table at specified path and False otherwise.

It does so by reusing the Rust internal is_delta_table_location() via the DeltaTableBuilder.

Additionally, this also adds documentation of the usage with examples for the DeltaTable.is_deltatable() method.

ion-elgreco commented 2 months ago

@omkar-foss thanks for the PR, I do believe we should do a bit differently. Now we are loading the state of the table, which could be potentially very large. We could also simply look if _delta_log exists or not, and whether that is then also not empty.

Or we do some other kind of lazy loading where we simply read the first log entry

omkar-foss commented 2 months ago

@ion-elgreco Yes I completely agree. I wasn't aware that the DeltaTable() constructor loads the entire table state Delta table eagerly (thanks for letting know!). If that's the case then indeed, checking the _delta_log and it's emptiness will be a lighter and a constant-time operation than checking the table state which might grow large and expensive.

As a matter of fact, the IsDeltaTable in Delta Lake OSS (Spark version) also checks for _delta_log and it's emptiness (see here), so I suppose it should be perfectly appropriate to apply the same logic here as well. It'll provide consistency of functionality as well for Delta Lake users.

I'll update the PR accordingly, thanks for your quick review 😁👍🏽

ion-elgreco commented 2 months ago

@omkar-foss there might be some internal code on the rust side that does something like this already, I am not certain, but worth it to check

omkar-foss commented 2 months ago

Hey @ion-elgreco, I've updated the PR as discussed.

I've attempted to reuse the Rust side internal is_delta_table_location() (code here) which checks for the _delta_log and returns a boolean result, and it's working well.

Kindly check out the PR when you get some time and let me know if any improvement suggestions or changes required. Thanks!

ion-elgreco commented 2 months ago

@omkar-foss Nice work! Thanks for the contribution :)