delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.63k stars 1.71k forks source link

Obligation to hardcode full path in local Delta Lake #555

Open gabmartini opened 4 years ago

gabmartini commented 4 years ago

Hello! I found out that you cannot query a local Delta Lake using SQL directly from the files if you don't use the full path (from root to the Delta Lake directory). I know that Delta Lakes are not supposed to be used in standalone clusters or non-distributed file system but for testing and (perhaps in the future, general public reachment) it's not a bad idea to generate a Delta Lake in a local directory to try it out when you are working in Visual Studio Code or another IDE, that change the working directory to the folder that you have opened and let you use 'relative paths'.

If you try to use data/delta-test (example) and in a PySpark session you try to make a query on that delta-lake, you get: `pyspark.sql.utils.AnalysisException: Unsupported data source type for direct query on files: delta;; But if you do /home/gabmartini/data/delta-test it works.

Very frustrating for beginners to be truly honest. Perhaps a very explicit remark in the documentation will help to clear that out.

Thanks and keep up the good work!

mgill25 commented 3 years ago

Can confirm. Encountered this issue. It wasn't fun debugging, I kept thinking the error had something to do with my local spark shell configuration. Quite annoying when trying to tweak various configurations via ALTER TABLE commands.

felipepessoto commented 1 year ago

Do we have any update on this?

scottsand-db commented 1 year ago

This seems like a good start task if anyone wants to pick it up!

felipepessoto commented 1 year ago

Do we know if any particular reason for the existing restriction?

Relatives path wouldn't cause issues for queries like: SELECT * FROM delta.myrelativepath?

It could be both, a delta table at myrelativepath, or a table called "myrelativepath" in the delta database

scottsand-db commented 1 year ago

Linking this to https://github.com/delta-io/delta/issues/1572

ryan-johnson-databricks commented 1 year ago

From what I undersetand, we use absolute paths because that's the only way to disambiguate the SQL grammar for commands like VACUUM that explicitly recognize path as different from identifier while lexing (one has a leading / and the other doesn't). Otherwise, as @scottsand-db pointed out, delta.foo could be a catalog table or a path and it's sketchy semantics at best to check both (e.g. which one prevails if both are present?)