Open caldempsey opened 8 months ago
This needs to be fixed with a refinement to the overall architecture as above. It's not a bug, just that PySpark can only run in client/driver mode when connecting to a standalone cluster.
Databricks might also have a solution for this with their latest DataLake connectors. Kind of a game-changer in the space. Something to read on.
Describe the problem
Reproduced in the notebook on https://github.com/caldempsey/docker-notebook-spark-s3/pull/6
At present we have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table. I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location.
This behavior seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data.
Steps to reproduce
Via the repo provided:
data
folder../../../notebook-data-lake/data:/data
, which prevents the notebook from accessing/data
at the same target shared with the Spark Master and Workers on their local filesystem.Observed results
When the notebook has access to
/data
(but is a connected application not a member of the cluster), Delta Tables write successfully with_delta_log
.When the notebook does not have access to
/data
it complains that it can't write_delta_log
, but parquet files still get written!Expected results
Expect the
_delta_log
to be written regardless of whether the Notebook has access to the target filesystem.Further details
Since this error is surfacing from PySpark I'm wondering if either the Notebook instance is somehow electing itself master via PySpark or if there's a bug in delta lake where you can’t write delta tables without the application call-site having access to the location. Neither of these sound right but I can't think of a third way.
Feel free to have a gander or submit a PR 🙏 !
Environment information