NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

clip appDuration to at least Duration #1096

Closed leewyang closed 3 weeks ago

leewyang commented 3 weeks ago

This PR fixes any possible negative predictions by clipping appDuration to be no less than Duration. This is only a corner case.

Changes

  1. clip value of appDuration to be no less than Duration.

Test

Following CMDs have been tested:

Internal Usage:

python qualx_main.py predict
python qualx_main.py evaluate
tgravescs commented 3 weeks ago

what is Duration here? SQL duration, job duration, other

We internally if we don't see the end of application event we set the appDuration to the last time we see - either the last job or sql query end time.

leewyang commented 3 weeks ago

appDuration is the application duration obtained from the profiler CSV files. Duration is the sum of sql durations.

This is admittedly a weird mix of wall clock vs. task durations, but I'm not aware of any better way to estimate the gpu appDuration (and this was the original formula from Bao's original notebooks).

Note: this is intended to guard against any cases where appDuration < Duration for CPU, which can lead to negative speedup values. We have never seen this condition in any of the training datasets, but per @amahussein, this might happen in the wild. With this code, the results should be the same as before for normal cases where appDuration > Duration, and only clips (appDuration - Duration) to zero in the worse case to avoid any negative speedup values. Note that this just models the non-SQL portion of appDuration to be zero (vs. negative) in those cases.

tgravescs commented 3 weeks ago

Right so we actually fixed this bug in the qualification tool - for its internal speedup calculations. This definitely can happen and we saw event log where this happened. It happens when you have overlapping SQL queries and jobs.

The approach we took was to apply ratio's from the task time to the wall clock times to have a better approximation of the wall clock while not being over the original time of the entire application.

So for instance lets say 50% of the task durations was supported DF operations... if overall wallclock app duration was 10 minutes, then we estimated that 5 minutes of that was support DF operations. But I'm not sure how all that applies with the features and models. Perhaps we need to think about doing non-wallclock times in the future.

leewyang commented 3 weeks ago

So, fundamentally, we have this formula (where the gpu numbers are predicted/estimated):

gpu_appDuration = cpu_appDuration - cpuDuration + gpuDuration

The xgboost model primarily works at the sqlID level, predicting gpuDuration per sqlID from cpuDuration along with other per-sqlID metrics, and then aggregating all the per-sqlID Durations to produce the final per-appID sums.

From what I understand, we would need to apply some ratio to both cpuDuration and gpuDuration to account for overlapping sqlIDs/jobs with something like this:

gpu_appDuration = cpu_appDuration - overlap_ratio * (cpuDuration + gpuDuration)

You stated using a percentage of task durations with supported DF operations to determine the ratio. Presumably, this is still just an estimate of the level of overlap/parallelism. So if the overlap_ratio is still not accurate, there would still be a (smaller) chance of producing a negative value (i.e. if cpu_appDuration and gpuDuration << cpuDuration).

Regardless, if the overlap_ratio is easy to obtain from the existing profiler logs, then we can modify this formula. Otherwise, we may want to just go with this patch as a short-term fix while we try to get the overlap_ratio figured out.

tgravescs commented 3 weeks ago

Yeah, I'm definitely fine with this short term, longer term we may want to look at if this is the right approach.