The timers used for GPU libraries are inaccurate

leofang commented 1 year ago

Since we're using the CPU timer perf_counter() as the proxy (technically we should use CUDA events, but it's OK), we need to do device-wide sync before and after the sandwiched code section; that is, the synchronization should also be inserted before line 29 that calls welch: https://github.com/data-apis/scipy-2023-presentation/blob/26db2c0d45c0715dc146f3124c05a17920b05a40/benchmarks/scipy_bench.py#L28-L33

leofang commented 1 year ago

Same applies to the other file: A sync should be inserted before line 43 that calls fit(): https://github.com/data-apis/scipy-2023-presentation/blob/26db2c0d45c0715dc146f3124c05a17920b05a40/benchmarks/scikit_learn_bench.py#L41-L45 in case the GPU is still busy (initializing input data) while the CPU already reaches to line 43.

leofang commented 1 year ago

Another note is these benchmarks do not do any warm-up for stabilizing the GPU state and preparing the array libraries, and it's problematic as the first call is always the slowest (initializing the CUDA context, loading the CUDA libraries, JIT'ing the kernels, if any, etc).

Generally, we want something like this https://github.com/NVIDIA/cuQuantum/blob/2cc68ede865d1f2028f28c9ad411f85748e20b58/benchmarks/cuquantum_benchmarks/_utils.py#L453-L498 in which the function is run in a warm-up loop, and then we start the benchmark loop, measure the times, and report average (+ confirm the std is low).

asmeurer commented 1 year ago

Regarding a warmup, yes, I noticed that the scikit-learn fit benchmark is considerably faster if you run it a second time in the same process (pytorch cuda goes from 2.4 seconds down to 0.3 seconds, cupy only goes from 1.1 seconds down to .97 seconds). For predict re-running in the same process doesn't affect the output, and I haven't tested welch yet.

The reason I am running the benchmarks in a separate process is because Tyler had mentioned this as an issue in his benchmarks https://github.com/tylerjereddy/bench_welch_array_api/blob/main/bench.py#L35-L38. I also did it because I felt this was more accurate to a real world application to run a "cold start".

I'd love to hear your thoughts on whether this makes sense, and what is standard to do?

leofang commented 1 year ago

Tyler's comment is a bit confusing, I think if the script focuses on one library at a time, it's OK to do a warm-up loop in the same process. There's no context switch, no destruction of the mempools, nothing. But we still want to launch one process per library, yes.

Cold-start data are frowned upon by CUDA ninjas, unfortunately 😅 From CUDA initialization to launching the first kernel, there are many unknowns (and some being or have been addressed by the CUDA team, such as lazy loading) irrelevant to the quality of a workload implementation and we should exclude them to reflect the status. One can argue that "hey but CUDA takes 10 seconds to initialize on my laptop," and no doubt it's the real world experience, but it's really irrelevant to our focus here (on making an array API compatible implementation run on GPUs). TBH even for CPU-only workloads, cold start is unfair as CPU frequency could dynamically adjust too.

asmeurer commented 1 year ago

So in my testing, I am already observing this just from running the timings in separate processes. For example, at https://github.com/data-apis/scipy-2023-presentation/blob/main/benchmarks/scipy_timings.csv you can see that the first torch_gpu, False run is slower than the rest, and the same for cupy, False.

The main thing that concerns me is the significant speedup of pytorch on scikit-learn's LinearDiscriminantAnalysis.fit benchmark. Like I said, if you run it in separate processes, it takes about 2.4 seconds, but if you re-run it in the same process, it takes 0.3 seconds subsequently. I'm not completely clear why this is, but I suspect the torch jit is doing something to make things faster. This doesn't happen with cupy, and it doesn't happen with the other two benchmarks (LinearDiscriminantAnalysis.predict or welch).

asmeurer commented 1 year ago

By the way, can you confirm that the synchronize calls should both go within the perf_counter block (i.e., be including in the timings)?

leofang commented 1 year ago

So in my testing, I am already observing this just from running the timings in separate processes.

Can't speak for PyTorch, but it's kinda expected for CuPy, for two reasons:

The CUDA driver would JIT & cache the kernel code on disk
CuPy would do the same on its own (different JIT)

so for subsequent process launches, some JIT overhead is avoided. So this further justifies that we should really exclude unnecessary noise.

The main thing that concerns me is the significant speedup of pytorch on scikit-learn's LinearDiscriminantAnalysis.fit benchmark.

Too good to be true? 😄 I don't know if PyTorch would optimize for repeated workloads/graphs internally, or if the PyTorch path somehow would trigger another cache mechanism, if so it'd explain. CuPy has no smart runtime or compute graph system backing it.

We should fix the timer and retry.

By the way, can you confirm that the synchronize calls should both go within the perf_counter block (i.e., be including in the timings)?

It appears that I wasn't being accurate, sorry 😓 We should do

sync()
t1 = timer()
...
sync()
t2 = timer()

so that between t1 & t2 only the target workload is under ~~exception~~ execution, since we're using the CPU timer as proxy here.

You might be wondering why the benchmark helper from cuquantum-benchmarks that I linked above does it differently. It's because we don't use the CPU timer to time the GPU workload there (note there is no sync before the 2nd CPU timer call). The CPU timers are only used to measure the time spent on host, which due to asynchronicity could be drastically different from the GPU elapsed time (measured via CUDA events).

asmeurer commented 1 year ago

Too good to be true? 😄

I honestly don't know. I don't have enough experience with GPU to know if this is reasonable or not or to guess why it is happening. I don't really even know what sorts of things fit is doing under the hood (@thomasjpfan would know more about that).

You might be wondering why the benchmark helper from cuquantum-benchmarks that I linked above does it differently. It's because we don't use the CPU timer to time the GPU workload there (note there is no sync before the 2nd CPU timer call). The CPU timers are only used to measure the time spent on host, which due to asynchronicity could be drastically different from the GPU elapsed time (measured via CUDA events).

Is there a more accurate GPU timer that I should be using? Or is that even worth bothering with?

Note that aside from the first run, the timings appear to have pretty low variance.

asmeurer commented 1 year ago

I updated the benchmarks script to run all benchmarks in the same process, sync before timing, and do a warmup run that isn't included in the timings. The biggest difference from before is that pytorch comes out as much faster for the scikit-learn fit run, and a little faster for the SciPy welch benchmark. See https://github.com/data-apis/scipy-2023-presentation/blob/main/benchmarks/bench.py. Let me know if you have any additional concerns.

leofang commented 1 year ago

@asmeurer LGTM! Note that NumPy results also get slightly better, if you compare the perf data before/after the changes (I assumed the benchmarks were done on the same machine) 🙂

leofang commented 1 year ago

btw has the plot in the paper been updated?

asmeurer commented 1 year ago

Yes, I updated the plot in the paper.

leofang commented 1 year ago

FYI, both PyTorch and CuPy noted the importance of proper synchronization and warm-up runs.

In particular, essentially PyTorch's Timer util is doing what we do here too (link).

data-apis / scipy-2023-presentation

The timers used for GPU libraries are inaccurate #11