delta-io / delta-kernel-rs

A native Delta implementation for integration with any query engine
Apache License 2.0
153 stars 42 forks source link

Improve performance when querying tables with many transactions in Azure #534

Open kyrre opened 2 days ago

kyrre commented 2 days ago

Please describe why this is necessary.

Querying tables in Azure Data Lake Storage that have a lot of transactions takes forever.

This happens because the object_store crate is not able to simply list the blobs that were created after the latest checkpoint. This appears to be a limitation in the Azure API.

This does not happen when using Apache Spark (Databricks). In this case the operation finishes almost instantly, but it's not clear what they do differently.

Describe the functionality you are proposing.

There has to be some trick that Apache Spark is using. Perhaps it's possible to use lastModified to do the filter?

Additional context

This will happen whenever the table is a streaming destination, so it's very unfortunate.

zachschuermann commented 1 day ago

Hi @kyrre! Thanks for opening this issue! One thing that could help us move forward on this front is to have a reproducible example (e.g. a table in Azure that we can observe takes 5s to read with delta-spark and 30s to read with kernel). Do you have a chance to look into that?

Also, yea, I recall there being some limitations with ADLS listing would be useful to document them here (and any of the limitations exposed via object_store)

scovich commented 1 day ago

Yes, IIRC ADLS listing API is very limited compared to S3 or GCS. In particular, you can't specify a lower bound on the listing, so it just lists the whole directory every time. Given that this doesn't seem to be a problem in spark, we should check what the ADLS hadoop client is doing. If it has some clever workaround, we should file an issue upstream for object_store to incorporate that same trick.

Unfortunately, I don't know that there's a lot we could do from the kernel side, if object_store doesn't make this efficient -- kernel is not generally in the business of solving cloud store API issues.