OpenLineage / OpenLineage

An Open Standard for lineage metadata collection
http://openlineage.io
Apache License 2.0
1.78k stars 309 forks source link

spark: port included in host causes crash of listener #795

Open mobuchowski opened 2 years ago

mobuchowski commented 2 years ago

Issue reported on Slack: https://openlineage.slack.com/archives/C01CK9T7HKR/p1653624210694359?thread_ts=1651498343.959749&cid=C01CK9T7HKR

Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart. I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this:

spark.openlineage.host https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com/
spark.openlineage.namespace default
spark.openlineage.version v1 <- also tried "1"

I can run some simple read and write commands and successfully find the log4j events highlighted in the docs:

INFO SparkContext;
INFO OpenLineageContext;
INFO AsyncEventQueue for each time I run the cell

After doing this a few times I get The spark context has stopped and the driver is restarting. Your notebook will be automatically reattached. stderr shows a bunch of things. log4j shows the same as for Kostikey: ERROR EventEmitter: [...] Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) I have one more piece of information which I can’t make much sense of, but hopefully someone else can; if I include the port in the host, I can very reliably crash the Spark Context on the first attempt. So:

https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com/ <- crashes after a couple of attempts, sometimes it takes me a while to reproduce it while repeatedly reading/writing the same datasets
https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com:80/ <- crashes on first try

Any insights would be greatly appreciated!

rejulius commented 2 years ago

I ran some more experiments, this time with a fake host and on OpenLineage 0.9.0, and was not able to reproduce the issue with regards to the port; instead, the new experiments show that Spark 3.2 looks to be involved.

On Spark 3.2.1 / Databricks 10.4 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 crashes when the first notebook cell is evaluated with The spark context has stopped and the driver is restarting. The same occurs when the port is removed.

On Spark 3.1.2 / Databricks 9.1 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 does not impede the cluster but, reasonably, produces for each lineage event ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: java.net.UnknownHostException The same occurs when the port is removed.

rejulius commented 2 years ago

The "Azure Databricks - OpenLineage - Microsoft Purview" integration suffers from the same issue; theirs is tracked here: https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/issues/14