Closed timrobertson100 closed 9 months ago
Every job contains datasetKey_attempt, so you go to yarn ui and filter by that, fx job from logs: http://c5master2-vh.gbif.org:8088/cluster/app/application_1677232479944_1517365
Since we can control the application name and match that against a job id, we can also add a Spark Listener to improve the information at different stages of a Spark application https://spark.apache.org/docs/3.1.2/api/java/index.html?org/apache/spark/scheduler/SparkListener.html
I don't see any logging issues at this moment, java issues - go to ELK, yarn - go to YARN
java.lang.IllegalStateException: Process has been finished with exit value - 1
I would just corrected the logging message to check yarn logs job_name
Spark logs can be redirected to ELK, we have all the code and configs. I personally found YARN UI better to read Spark issues
Thanks @muttcg
Error: Process failed in distributed Job. Check yarn logs
Would be a helpful change. I hadn't realized the dataset key could be used.
Deployed to PROD
During the processing of many datasets several failed in the
EVENTS_INTERPRETED_TO_INDEX
stage. The linked logs&_a=(columns:!(_source),filters:!(),index:'439da4d0-290a-11ed-8155-a37cb1ead50e',interval:auto,query:(language:lucene,query:'datasetKey.keyword:%224a99b94e-0c47-4a0b-bb8f-68248e484806%22%20AND%20attempt:%221%22'),sort:!(!('@timestamp',desc)))) indicate that a distributed submission failed, but don't give much further info (example below).To aid diagnostics, especially when many datasets are being processed, it would be helpful if this could log some further information if possible; e.g. a job ID that could be used from the history server, whether the job wasn't accepted by YARN, or a timeout or so.
Aside: we might consider making use of Airflow (as an Oozie replacement) to launch all Spark jobs as that would then provide the history of failed jobs and links through the history server etc, similar to how downloads work. The pipeline coordinator would then hold the Airflow job ID.