cupy / cupy-performance

5 stars 3 forks source link

Fixed benchmarks #6

Closed MrRobot2211 closed 3 years ago

MrRobot2211 commented 3 years ago

Hi I am currently working in using CuPy as a backend to QuTiP as part of my GSoC project. I am trying out different ways of benchmarking CPU and GPU times.

I found there was an error here where all GPU timins for an iteration would be crammed together


xp | backend | name | key | time | dev | run
-- | -- | -- | -- | -- | -- | --
cupy | cupy-cpu | time_raw | 10 | 8.210015948861837e-06 | cpu | 0
cupy | cupy-cpu | time_raw | 10 | 9.069975931197405e-06 | cpu | 1
cupy | cupy-cpu | time_raw | 10 | 8.02002614364028e-06 | cpu | 2
cupy | cupy-cpu | time_raw | 10 | 7.3499977588653564e-06 | cpu | 3
cupy | cupy-cpu | time_raw | 10 | 7.270020432770252e-06 | cpu | 4
cupy | cupy-cpu | time_raw | 10 | 7.279973942786455e-06 | cpu | 5
cupy | cupy-cpu | time_raw | 10 | 7.320020813494921e-06 | cpu | 6
cupy | cupy-cpu | time_raw | 10 | 7.190043106675148e-06 | cpu | 7
cupy | cupy-cpu | time_raw | 10 | 7.229973562061787e-06 | cpu | 8
cupy | cupy-cpu | time_raw | 10 | 7.819035090506077e-06 | cpu | 9
cupy | cupy-gpu | time_raw | 10 | [1.55520001e-05 1.44959996e-05 1.25439996e-05 ... | gpu | 0
cupy | cupy-cpu | time_raw | 100 | 7.3499977588653564e-06 | cpu | 0
cupy | cupy-cpu | time_raw | 100 | 8.159957360476255e-06 | cpu | 1
cupy | cupy-cpu | time_raw | 100 | 7.179973181337118e-06 | cpu | 2
cupy | cupy-cpu | time_raw | 100 | 7.2499969974160194e-06 | cpu | 3
cupy | cupy-cpu | time_raw | 100 | 7.0400419645011425e-06 | cpu | 4
cupy | cupy-cpu | time_raw | 100 | 6.990041583776474e-06 | cpu | 5
cupy | cupy-cpu | time_raw | 100 | 7.070018909871578e-06 | cpu | 6
cupy | cupy-cpu | time_raw | 100 | 6.820017006248236e-06 | cpu | 7
cupy | cupy-cpu | time_raw | 100 | 6.979971658438444e-06 | cpu | 8
cupy | cupy-cpu | time_raw | 100 | 7.030030246824026e-06 | cpu | 9
cupy | cupy-gpu | time_raw | 100 | [1.22880004e-05 1.22880004e-05 1.08160004e-05 ... | gpu | 0

this PR makes a quick--fix by callinf explode on the df.

Is this package being maintained currently or are you leveraging other tools ?

Maybe @leofang you know something?

leofang commented 3 years ago

Hi @MrRobot2211 Thanks for pinging. Could you share a reproducer for the error that you encountered? I am having trouble understanding what's going on.

MrRobot2211 commented 3 years ago

If you were to run python prof.py -p -c benchmarks/bench_raw.py without the fix the plot would fail. If you look at the the resulting csv you will see that gpu time entries will have a string of an array of times instead of one time cupy | cupy-gpu | time_raw | 10 | [1.55520001e-05 1.44959996e-05 1.25439996e-05 ... | gpu | 0 . Internally the reason this happens is that the gpu timings are being stored in a dataframe as one array, and when passed to seaborn for plotting it does not kknow what to do.

leofang commented 3 years ago

@MrRobot2211 Thanks, I see what you meant now. I encountered an pandas.core.base.DataError when running the benchmark, but didn't realize your output was taken from the csv file (with the -c flag).

@emcastillo I think a better fix is to "flatten" the benchmark result:

diff --git a/cupy_prof/measure.py b/cupy_prof/measure.py
index d58531b..131a69f 100644
--- a/cupy_prof/measure.py
+++ b/cupy_prof/measure.py
@@ -12,7 +12,11 @@ class Measure(object):

     def capture(self, name, key, times, xp_name):
         for dev in times:
-            for i, time in enumerate(times[dev]):
+            if dev == 'gpu':
+                dev_times = times[dev][0]
+            else:  # dev == 'cpu'
+                dev_times = times[dev]
+            for i, time in enumerate(dev_times):
                 self.df['xp'].append(xp_name)
                 self.df['backend'].append('{}-{}'.format(
                                           xp_name, dev))

The reason is that cupyx.time.repeat collects results for each device, but if its device argument is not set as done here https://github.com/cupy/cupy-performance/blob/cdcb205a07d228d6e0ac583bc1510f632dd768e9/cupy_prof/runner.py#L53-L54 the current device is used, so the array shape for GPU in this case is (1, 10) instead of (10,) as on CPU. @MrRobot2211 Could you apply my patch instead? Thanks.

emcastillo commented 3 years ago

Thanks! will look at it during the day :)

leofang commented 3 years ago

Thanks, @MrRobot2211 @emcastillo!

emcastillo commented 3 years ago

great catch guys! thanks a lot!