Open kyrre opened 2 days ago
Hi @kyrre! Thanks for opening this issue! One thing that could help us move forward on this front is to have a reproducible example (e.g. a table in Azure that we can observe takes 5s to read with delta-spark and 30s to read with kernel). Do you have a chance to look into that?
Also, yea, I recall there being some limitations with ADLS listing would be useful to document them here (and any of the limitations exposed via object_store
)
Yes, IIRC ADLS listing API is very limited compared to S3 or GCS. In particular, you can't specify a lower bound on the listing, so it just lists the whole directory every time. Given that this doesn't seem to be a problem in spark, we should check what the ADLS hadoop client is doing. If it has some clever workaround, we should file an issue upstream for object_store
to incorporate that same trick.
Unfortunately, I don't know that there's a lot we could do from the kernel side, if object_store
doesn't make this efficient -- kernel is not generally in the business of solving cloud store API issues.
Please describe why this is necessary.
Querying tables in Azure Data Lake Storage that have a lot of transactions takes forever.
This happens because the
object_store
crate is not able to simply list the blobs that were created after the latest checkpoint. This appears to be a limitation in the Azure API.This does not happen when using Apache Spark (Databricks). In this case the operation finishes almost instantly, but it's not clear what they do differently.
Describe the functionality you are proposing.
There has to be some trick that Apache Spark is using. Perhaps it's possible to use lastModified to do the filter?
Additional context
This will happen whenever the table is a streaming destination, so it's very unfortunate.