[X] I have searched in the issues and found no similar issues.
Describe the bug
I use kyuubi batch api v2 to submit the spark job of yarn cluster.
When the API node being called is inconsistent with the node submitting the spark job, an error message that the resource file cannot be found will be reported. I analyzed that the reason is that when I call the API, kyuubi will place the uploaded resource file in a local directory, but this directory is not shared among multiple workers of kyuubi. As a result, when the batch task is scheduled to be submitted to other nodes, the resource file cannot be found.
The solution I tried,
kyuubi has an environment variable:kyuubi_work_dir,I changed this directory to point to the shared storage.
But i failed,
The problem encountered is that jobs are submitted occasionally. You can see the spark submission log in the kyuubi server, and you can also find the corresponding batch id in the database, but the submission is not successful and the Yarn App Id cannot be obtained. The status of kyuubi will change from PENDING to ERROR very quickly.
By calling the locallog interface of batch, no valid error content could be found. (Because it was an accident in the production environment, it has been rolled back and no screenshots can be taken). However, the locallog interface mentions the detailed error log path, which is a log file in the username subdirectory in the kyuubi work directory (the shared directory configured in the environment variable).
When I accessed this log file, I found that the file content described another job. At this time, I realized that the multi-node shared work directory may have caused job conflicts.
I realized that the uploaded resource files might also have conflicts, so I executed the following query.
It can be confirmed that shared directories will cause multi-node resource file and log conflicts.
But I can't confirm whether this is the reason for the occasional task submission exception.
Are you willing to submit PR?
[ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
Code of Conduct
Search before asking
Describe the bug
I use kyuubi batch api v2 to submit the spark job of yarn cluster. When the API node being called is inconsistent with the node submitting the spark job, an error message that the resource file cannot be found will be reported. I analyzed that the reason is that when I call the API, kyuubi will place the uploaded resource file in a local directory, but this directory is not shared among multiple workers of kyuubi. As a result, when the batch task is scheduled to be submitted to other nodes, the resource file cannot be found.
Affects Version(s)
1.9.1
Kyuubi Server Log Output
No response
Kyuubi Engine Log Output
No response
Kyuubi Server Configurations
Kyuubi Engine Configurations
No response
Additional context
The solution I tried, kyuubi has an environment variable:
kyuubi_work_dir
,I changed this directory to point to the shared storage. But i failed, The problem encountered is that jobs are submitted occasionally. You can see the spark submission log in the kyuubi server, and you can also find the corresponding batch id in the database, but the submission is not successful and the Yarn App Id cannot be obtained. The status of kyuubi will change from PENDING to ERROR very quickly.By calling the locallog interface of batch, no valid error content could be found. (Because it was an accident in the production environment, it has been rolled back and no screenshots can be taken). However, the locallog interface mentions the detailed error log path, which is a log file in the username subdirectory in the kyuubi work directory (the shared directory configured in the environment variable).
When I accessed this log file, I found that the file content described another job. At this time, I realized that the multi-node shared work directory may have caused job conflicts.
I realized that the uploaded resource files might also have conflicts, so I executed the following query.
It can be confirmed that shared directories will cause multi-node resource file and log conflicts. But I can't confirm whether this is the reason for the occasional task submission exception.
Are you willing to submit PR?