Question on benchmarking gordon workload

ArgonneCPAC / gordon

BSD 3-Clause "New" or "Revised" License

1 stars 2 forks source link

Question on benchmarking gordon workload #3

Open jczaja opened 1 year ago

jczaja commented 1 year ago

Hi,

My name is Jacek Czaja and I'm on of Intel engineers to help with having JAX projects running efficiently on Intel devices. I was given this repo link (among others) to have its content enabled on JAX with Intel GPU and optimized for performance. I was able to run unit tests on Intel GPU, but I'm struggling to do benchmarking of this gordon functionality.

What do I need? I would like to do bench-marking of functionality of this repository on Intel HW. In order to do so I need representative(something that is close to real use case that should be optimized) example of usage of gordon functionality that runs at least a minute , so I can take a look at bottlenecks and try to optimize them. Please point me to such an example of gordon functionality.

beckermr commented 1 year ago

I am so sorry I missed this!

To make this into something useful, set the env var GORDON_NM to something big before running. The default is 500 so you'll want something a lot bigger. Once you do that, run the test suite and it should take a while.

jczaja commented 1 year ago

@beckermr Hi, thanks for helpful hint. I 'm just resuming this task. I was told that functionalities tested in those Unit tests are to be part of bigger workload. So do you know which element out of those three (test_gordon.py) is taking more execution time than other in this bigger workload? That info would help me to prioritize optimization efforts.

Anyway, As soon as we have something improved we will keep you posted. Thanks

beckermr commented 1 year ago

I do not have an estimate for this.

jczaja commented 1 year ago

I have started to do some profiling of those unit tests, but majority of time is spent on single-threaded CPU rather than on XPU(Ponteveccio). So I'm looking if python code could be easily run on XPU(via JAX) rather than using regular numpy(CPU). This PR : #4 will shorten initialization time of test so I can easily see performance of other areas of code. Please review

beckermr commented 1 year ago

Done and thank you!

jczaja commented 1 year ago

@beckermr What is typical GORDON_NM value used in target workload? I'm asking as I should be looking at performance optimization of functionality as close as to your target model as possible.

beckermr commented 1 year ago

As big as we can without overflowing the device memory.