[BUG] Qualification estimate is not generated when `SparklistenerApplicationStart` is missing from the eventlog

kuhushukla commented 2 weeks ago

Describe the bug Incomplete eventlogs are frequently seen in databricks environment during qualification runs. During a debug session @parthosa mentioned that the qual tool cannot proceed if app start event is missing from a given set of eventlogs which makes sense since we cannot estimate clearly what the duration for this run is.

While playing with databricks notebooks, in many cases, there is high probability that since this same app context is used for multiple runs of a notebook, the final consolidated eventlogs for a given job run may not contain the application start event which is needed to calculate the duration of the SQLs in question. The qual tool currently estimates application end time based on job and sql level events. We might want to consider doing something similar for start times (based on lowest value of job or sql start time epoch events) to reduce the chances of having no qual results for these hot runs. This issue can be used to discuss/define whether a start estimate , like the end time estimate is a good enough way to generate outputs or any other approach that may help us work around this persistent issue.

Steps/Code to reproduce bug Databricks notebook runs with a live app context and then downloading the eventlog.

Expected behavior qual tool should output valid durations for such runs that have other ways to estimate start times.

Environment details (please complete the following information)

Databricks

Additional context We frequently receive customer logs that follow this pattern

tgravescs commented 2 weeks ago

the problem is then how do you qualify the entire app when you don't have data for the entire app. We have the per sql based qualification that perhaps is interesting in this case? I'm not sure if to works out of box without start time but that would be more doable.

The interactive notebook case where users are interactively submitting things doesn't seem like a good usecase for the qualification tool.

amahussein commented 2 weeks ago

ApplicationStartEvent has other imp. information too

case class SparkListenerApplicationStart(
    appName: String,
    appId: Option[String],
    time: Long,
    sparkUser: String,
    appAttemptId: Option[String],
    driverLogs: Option[Map[String, String]] = None,
    driverAttributes: Option[Map[String, String]] = None)

kuhushukla commented 2 weeks ago

We have the per sql based qualification that perhaps is interesting in this case? I'm not sure if to works out of box without start time but that would be more doable.

Agree.

kuhushukla commented 2 weeks ago

The interactive notebook case where users are interactively submitting things doesn't seem like a good usecase for the qualification tool.

The ATT case is not that of an interactive notebook AFAIK and yet it hits this.

kuhushukla commented 2 weeks ago

ApplicationStartEvent has other imp. information too If this information is not readily available from other events, it certainly blocks us from doing anything about this issue. I dont know if that is the case. I do see appid elsewhere but others I am not so sure about.

NVIDIA / spark-rapids-tools

[BUG] Qualification estimate is not generated when `SparklistenerApplicationStart` is missing from the eventlog #1112