TheFl0w commented 5 years ago

We are currently working on benchmarks to determine how fast we can process JUNGFRAU data. We have constructed two use-cases of our software that we are going to benchmark with different parameter sets. We expect processing on GPUs to be quite fast, making our program throughput-bound while processing on CPUs might actually be compute-bound. That's why we want to run the benchmarks on a set of different architectures. Runtime will be measured in total, and per kernel for CPUs. Do you think the use-cases I listed below are appropriate? Are there any specific parameters you would like to see varied in benchmarks?

Architectures

CPU single-threaded baseline: Haswell CPU with TBB, OMP: Haswell, Power9, Threadripper CUDA: P100, V100 (with 1, 2, 4 GPUs) HIP/AMD: AMD Vega64

Use-Cases

Photoncounter

algorithm:
- pedestal calibration
- convert detector data to number of photons for all frames
- summation over N frames
- download results of summation
parameter space:
- input data sets: 1) only gain stage G0 (force pedestal update every time) 2) only gain stage G1 and G2 (no pedestal updates) 3) real data set for JUNGFRAU
- summation window size: 10, 100, 200, 500

Cluster Finder Algorithm

algorithm:
- pedestal calibration
- run cluster finder on all input frames
- download clusters
parameter space:
- cluster sizes: 2, 3, 7, 11
- input data sets: 1) no cluster centers 2) 2-4% of pixels are cluster centers 3) 8% of pixels are cluster centers

Cheers, Florian

sredford commented 5 years ago

Hi Florian,

Commenting only on the photoncounter section:

can you test the difference in speed with the pedestal tracking enabled and without? In some cases it's not necessary
is the pixel mask calculation contained within the point on 'pedestal calculation'?
in my experience 2-100 frames summed would be realistic. Test 2, 10, 20 and 100? @lopez-c may have an alternative suggestion here.

Cheers, Sophie

kloppstock commented 5 years ago

Hello Sophie,

thank you for your feedback! Since we plan to benchmark our code with a data set containing only values in gain stage G1 and G2, this should be equivalent to disabling the pedestal updates. However, we will definitely consider benchmarking the real JUNGFRAU data set with disabled pedestal updates.

The pedestal calculation already includes the mask generation. For the algorithms in the main part it is then possible to choose whether to use this mask.

Cheers, Jonas

sredford commented 5 years ago

Hi Jonas,

I need to better understand your points under photon counter input datasets.

All our datasets are 'real', we do not simulate. So we have to try and find real data matching the conditions that you need.

a) For a dataset in G0 only I could supply you with a low intensity dataset where there are only single photon hits. This would mean the pedestal updating in the empty pixels, but not when the photon hits occur as that would be above threshold. If you really want to ensure that the pedestal updates in every frame, then a dark / pedestal dataset should be used, but then there will be no photon counts for the rest of the algorithm to process.

b) It would be unusual in an experimental setup to illuminate the whole detector surface with so much intensity that all pixels switch to G1 or G2 in every frame. I don't recall having any dataset like this to give you!

Instead of using a and b to test the effect of pedestal tracking, wouldn't it be better to test on the same dataset with and without tracking switched on? That way we would be sure that the only difference in performance comes from the tracking, and not from other effects like occupancy or gain switching.

c) for 'real', do you mean a dataset in which gain switching takes place? Is the first dataset we sent you sufficient?

We could talk on the phone to clarify?

Cheers, Sophie

kloppstock commented 5 years ago

Hello Sophie,

We plan to use you data set in addition to two synthetic data sets generated by us to see how our code performs in extreme cases. For the photoncounter part this would be having all values in the G0 gain stage or having none in this stage. If you have any more data sets you want to provide, we will integrate these into our benchmarks.

If you have more questions you can call Heide or comment on this issue.

Cheers, Jonas

lopez-c commented 5 years ago

Hi Florian, Jonas,

I think that with the set of use cases you suggest is fine. We could also decide whether we want to benchmark the application also with a set of masked pixels or not. Perhaps, that's also something to consider.

One thing that maybe you have already considered is the size of the blocks you want to process together (number of frames per block)

Another thing is the choice of summation factors to benchmark that feature. I second Sophie's proposal of testing the values 2,10,20 and 100. From the feedback we got, it seems that low summation factors (2-10) are preferred by the users, whereas values greater than 100 do not seem to suit any application.

Regarding the benchmark, I would like to know the scope of your benchmark. Would it also include the data transfers or only the pure processing/computation?

As for the datasets, let us know what datasets and how much data you need and we can try to get it for you. It can happen that we do not have the datasets you would need (as in the case of the only G1 & G2 datasets for Jungfrau) but we can see what is possible. For instance, I am not sure if we can get a dataset fro the cluster finder where 8% of the pixels are cluster centres, since in those applications using the cluster photon finder the photon-hit rate is very low. But maybe your idea is to simulate those conditions.

Cheers, Carlos

kloppstock commented 5 years ago

Hi Carlos,

It should not make a big difference if a pixel is masked or not because other than being stored the calculations are the same for the non-masked pixels. However, we will take this into consideration.

We will definitely benchmark the number of images we upload to the GPU. Somehow we forgot to mention this in this issue.

The benchmark will include everything needed to process a certain number of frames residing in the CPU memory. This includes the transfer times as well.

We will simulate some extreme cases like the 8% cluster centers and the data sets only in G0 or G1/G2. If you have some data sets for more edge cases, like a high cluster center count, we would like to include them in our benchmark.

Cheers, Jonas

ComputationalRadiationPhysics / jungfrau-photoncounter

Benchmarks #53

Architectures

Use-Cases

Photoncounter

Cluster Finder Algorithm