NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
715 stars 111 forks source link

[BUG] User cannot deploy Merlin image >=23.04 on Azure Databricks #1055

Open rnyak opened 12 months ago

rnyak commented 12 months ago

Bug description

The user reported this error when they try to deploy merlin-tensorflow image >= 23.04. They are able to deploy merlin-tensorflow:23.02 image on Azure databricks. One main different is cuda versions in these images.

Spark driver could not be reached on startup. This issue can be caused by invalid Spark configurations or malfunctioning [init scripts](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Fdatabricks%2Fclusters%2Finit-scripts%23global-and-cluster-named-init-script-logs&data=05%7C01%7Cronaya%40nvidia.com%7Cfe78a893b81e491de97208db82eee73e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638247734960282987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=inGDUr3qE2Xy%2BYdYVbF6C39%2BCH4syUZkTOOgaRvk6J4%3D&reserved=0). Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.

Internal error message: Spark failed to start: Could not connect to driver instance. Possible reason: network misconfiguration.

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context

An eng from Rapids team did some debugging about the spark cluster issue that this user is facing with merlin-tensorflow:23.04 image. Rapids eng spent some time converting the instructions from https://docs.databricks.com/clusters/custom-containers.html#option-2-build-your-own-docker-base into some tests that we can run with container canary:

https://github.com/NVIDIA/container-canary/blob/main/examples/databricks.yaml

Here are some quick notes on running the test:

https://gist.github.com/jacobtomlinson/73f30f5657a370e7ed2a559b0eb7123f