delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.24k stars 1.63k forks source link

_delta_log permission issue #471

Open Lukas012 opened 4 years ago

Lukas012 commented 4 years ago

Environment: Spark Standalone in a distributed cluster. Spark Worker Nodes are running with userid "X". Spark Driver runs as userid "Peter" and start a Spark Job wich creates a delta lake table.

Problem: The part-000.....snappy.parquet files are written by user X (which is correct from my point of view) The folder _delta_log belongs to user "Peter". Also the files created in this folder belong to "Peter".

When Spark driver starts another spark job with userid "Tom". He will get Permission denied, because he has no access to _delta_log since the folder belongs to Peter.

Expected behavior: All files and folders are created by userid "X".

Lukas012 commented 4 years ago

Moreover Delta Lake also ignores "spark.hadoop.fs.permissions.umask-mode" for the _delta_log folder.

tdas commented 4 years ago

From the documentation of umask-mode, it seems that just passing it in the hadoop configuration should work. We are using the SparkSession's hadoop configuration to do all the file operations. Have you tried setting this Hadoop configuration in the spark session and it still did not create files with right umask?

Lukas012 commented 4 years ago

Yes. I exactly did this. The problem is as said only the _delta_log folder. It seems to me that delta creates this folder with the user who executes the spark driver job. I assume this problem only occurs in spark's 'client' mode (which is currently the only option for python jobs in standalone mode)

tdas commented 4 years ago

That is indeed odd. Can you elaborate on how exactly (code? conf file? command line conf?) are setting the spark.hadoop.fs.permissions.umask-mode before creating the delta table? what is the value of the umask you are using and what is the umask you are actuallyl seeing on the _delta_log ?

Lukas012 commented 3 years ago

Yes. However the code is not so complicated. The thing is more to setup a standalone cluster and submit the job via spark`s client mode. Driver is executed by Peter. Spark worker run as "X".

spark = SparkSession \
    .builder \
    .master("spark://my_spark_master.my.company:7077") \
    .config("spark.hadoop.fs.permissions.umask-mode", "000") \
    .appName("My App") \
    .getOrCreate()

data = {'name': ["Hans"],
        'adress': ["miller street 3"]
       }
df = ks.DataFrame(data)
df.to_delta("/my/path/to/the/table", mode='overwrite')

Result: grafik

zsxwing commented 3 years ago

This is because _delta_log and files in _delta_log are created in Driver. The parquet files are created in Spark worker. But IIUC, the correct behavior should be: all files should belong to Peter since he runs the job. Any reason you would like the opposite?

Lukas012 commented 3 years ago

Afaik the spark default behavior when creating files (e.g. csv) is that the owner of the files is the user running the worker nodes.

ziadrida commented 3 years ago

So where did this conversation end, I am having the same issue: y4JJavaError: An error occurred while calling o26169.save. : java.io.FileNotFoundException: //_delta_log/00000000000000000002.json (Permission denied) at java.io.FileOutputStream.open0(Native Method)

workers running as root and driver as zeppelin

travisclagrone commented 3 years ago

This issue also manifests in Delta Lake on Azure Databricks over Azure Data Lake Storage remote store.

brucenelson6655 commented 3 years ago

Using fs.permissions.umask-mode 002 does solve the problem for Databricks Delta on Azure ADLS gen2 - recommend setting the spark config in the cluster setting .. notebook scoped seems to be unreliable.

jesseryoungVUMC commented 3 years ago

The fix that @brucenelson6655 mentions does indeed fix this issue when using Azure ADLS Gen2. However, according to the ADLS Documentation (https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control#umask) the correct mask should be "007". You can indeed test this in a notebook in Azure Databricks, but you need to Detach & Reattach before changing the configuration setting.

Either

spark.conf.set("fs.permissions.umask-mode", "007")

or in the Spark Config setting of your cluster

fs.permissions.umask-mode 007

I haven't dug into the code for Delta, but I suspect this to be an issue with the Hadoop ABFS driver and not an issue with Delta.

brucenelson6655 commented 3 years ago

remember that umask is a bitwise AND using the bitwise NOT of the base permissions .. so 007 is going to give user / group rwx and other no access vs. 002 gives user/group rwx and others r/o permissions