apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.09k stars 913 forks source link

[Bug] When submitting a spark job in Yarn cluster mode, an error occurs that the resource file cannot be found. #6771

Open BohanZhang0222 opened 3 days ago

BohanZhang0222 commented 3 days ago

Code of Conduct

Search before asking

Describe the bug

I use kyuubi batch api v2 to submit the spark job of yarn cluster. When the API node being called is inconsistent with the node submitting the spark job, an error message that the resource file cannot be found will be reported. I analyzed that the reason is that when I call the API, kyuubi will place the uploaded resource file in a local directory, but this directory is not shared among multiple workers of kyuubi. As a result, when the batch task is scheduled to be submitted to other nodes, the resource file cannot be found.

Affects Version(s)

1.9.1

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

kyuubi.batch.impl.version=2
kyuubi.batch.submitter.enabled=true

Kyuubi Engine Configurations

No response

Additional context

The solution I tried, kyuubi has an environment variable:kyuubi_work_dir,I changed this directory to point to the shared storage. But i failed, The problem encountered is that jobs are submitted occasionally. You can see the spark submission log in the kyuubi server, and you can also find the corresponding batch id in the database, but the submission is not successful and the Yarn App Id cannot be obtained. The status of kyuubi will change from PENDING to ERROR very quickly.

By calling the locallog interface of batch, no valid error content could be found. (Because it was an accident in the production environment, it has been rolled back and no screenshots can be taken). However, the locallog interface mentions the detailed error log path, which is a log file in the username subdirectory in the kyuubi work directory (the shared directory configured in the environment variable).

When I accessed this log file, I found that the file content described another job. At this time, I realized that the multi-node shared work directory may have caused job conflicts.

I realized that the uploaded resource files might also have conflicts, so I executed the following query. image

It can be confirmed that shared directories will cause multi-node resource file and log conflicts. But I can't confirm whether this is the reason for the occasional task submission exception.

Are you willing to submit PR?

github-actions[bot] commented 3 days ago

Hello @BohanZhang0222, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.