SE++ doesn't always use full CPU power

AstroAure commented 1 month ago

Hi,

I'm working with @mShuntov on running SE++ on big (more than 10 000 x 10 000px) images. To make the computations faster, we're using Amazon Web Services EC2 which offers scalable cloud computing[^1]. I've worked on benchmarking SE++ on different image sizes and different EC2 machines to find the optimal machine to make SE++ run the fastest. However, I see that the CPU usage rarely reaches 100%, even on big images that take many hours to run on SE++. I tried to modify the thread_count parameter, but beyond a certain point it didn't seem to help.

Here are my (empirical) conclusions :

If thread_count is set too low, SE++ can't use the full power of the machine and ends up being slowed down.
If thread_count is set too high, the threads pile up and SE++ ends up being slowed down.
If the chosen machine is not powerful enough, the CPU usage reaches 100% and the run time ends up being roughly linear to the number of sources and the number of measurement images.
If the chosen machine is overpowerful, SE++ ends up being the bottleneck and CPU usage doesn't reach 100% (it's possible to artificially increase it by increasing thread_count but this doesn't make SE++ faster).
Disabling hyper-threading limits the CPU usage to 50% and keeps a similar run time in cases where the CPU usage with hyper-threading wasn't reaching 100%.

Here are some plots summarizing my benchmark, and a more detailed analysis :

This first one shows runs on small images (0.25 arcmin²=450 sources and 1.0 arcmin²=1570 sources). The metric I chose is the runtime in seconds per sources per bands (measurement images). One can see here the plateau of thread_count, making no improvment on the runtime after some point. Another surprise is to see how 16 bands is much faster than 2, 4 or 8 bands. I don't understand where this big gap comes from !

This second plot shows unfinished runs of SE++ on bigger images (4.0 arcmin²=6100 sources and 16.0 arcmin²=27500 sources). The runtime is then estimated based on the time it took to reach a specific percentage of the number of sources. Again, we see the tendency for an optimal thread_count. Here the c6a.4xlarge machine is the bottleneck for such a big task and we can see that 8 bands is roughly twice as fast as 16 bands because the CPU is running at 100%. We also see the effect of disabling hyper-threading, lowering the CPU usage without sacrificing run time.

Could we be enlighted on the thread_count parameter and why SE++ doesn't seem to always use the CPU at its full potential ?

[^1]: I've written a tutorial with different bash scripts to make the use of AWS EC2 more friendly with VS Code and Jupyter notebooks : https://github.com/AstroAure/VSJupytEC2

mkuemmel commented 1 month ago

It's in our experience no problem at all to get several threads to 100%. Here the throughput dependent on the # of cores: F06_f1 So up to 60 cores there is not problem.

But to get this you need to give more memory to the tile manager via the parameter --tile-memory-limit. The default is 500MB which is way too small for your large images. Then you are heavily I/O limited.

I would start with the following settings:

give half of the available RAM to --tile-memory-limit, the other half is for overhead;
use all the available cores via --thread_count, this should be fine at least until 32 cores or similar;
you may want to enhance the--tile-size from its default (256) to 1k or similar.

Please let us know your results!

AstroAure commented 1 month ago

For the previous runs, I had --tile-memory-limit set at 131072, which is way over half the RAM. Could it be an issue to put it too high ? I also had --tile-size at 131072 for plenty of margin. By the way, I read that SE++ can use storage for temporary files if the available RAM is not enough. From my experience, when I try to run SE++ on big images, with a machine that doesn't have enough RAM, SE++ just crashes (I see the RAM usage rising and when it reaches >90%, SE++ stops). Is there a way to avoid that ?

For --thread_count, I often have faster runs when setting it to a different value than the number of cores I have. For example, for a machine with 32 vCPUs (16 cores, 2 threads per core), it looked that 8-16 was the sweet spot for the 0.25 and 1.0 arcmin², and setting it to 128 was faster for the 4.0 arcmin². Do you remember (even roughly) the number of sources in the images you used for your plot ?

mkuemmel commented 1 month ago

The plot was done on single epoque fitting of about 50k sources or so.

I don't think you can have too much --tile-memory-limit.

But the unit of --tile-size is [pix]. The image junks that are stored in RAM are tile-size x tile-size large. A value of 131072 does not make sense, for your 10k images I would use 1k or 2k.

mkuemmel commented 1 month ago

Another thing we found out just recently, the processing is faster if the objects that are fed to the fitting are ordered.

Now, if you do the model fitting with detection you get the ordering for free, since the detection work in a sequence over the image.

If you make model fitting without detection I would order the objects according to ra or dec or x or y. This reduces the I/O since the sequence of objects sent to the fitting are usually close and frequently on the same tile which is already in RAM, which avoids frequent data loading.

AstroAure commented 1 month ago

I've figured out why 16 bands was so much faster than 4 or 8... Out of the 16 bands, 2 were completely blank (only 0s on the images and weight maps) for the cutouts I considered. And it would seem that SE++ doesn't perform model fitting when one of the measurement images is completely blank. All of the sources ended with fmf_stop_reason = 7 and the values for the parameters of my model (Sersic profile) were all set at their initial value in the output catalog. Is this normal or known ? Or is it something wrong with my Python configuration file ?

mkuemmel commented 1 month ago

First of all, do your CPU's go to 100% now?

mShuntov commented 1 month ago

Thanks @mkuemmel for this insight! I do manage to get a more optimal allocation of the resources by implementing this advice!

Also, on my side I am using the ASSOC mode withoud detection, and the ordering makes so much sense. I do see a slight speed-up by. doing the ordering, so thanks for that!

AstroAure commented 1 month ago

First of all, do your CPU's go to 100% now?

Yes, by removing these blank images, it seems to work now ! And your insights on the parameters was very helpful, thank you.

However, the CPU usage dies off for the very last percent (last 5-10%). This is the CPU usage for a full SE++ run (from 13:20 to 14:00, the max is 100%, 90% of measured sources was reached at around 13:40) :

mkuemmel commented 1 month ago

You have to keep in mind that each CPU runs one source. And if there is a large source or a large source group at the end of the processing that takes much longer than a typical source those ones stay and there is nothing to keep the other cores busy. So this flaring out is quite typical. If you want to confirm this you can write out an unsorted catalog with a small junksize, and the last entries should come from large objects that take very long to finish. We have been discussing changing the processing order and doing the touch ones first, which is possible for the no-detection stuff @mShuntov is doing. The flip side would likely be heavy RAM usage at the beginning, when tackling the large and complex sources, which has disadvantages for the Euclid case.

mkuemmel commented 1 month ago

I've figured out why 16 bands was so much faster than 4 or 8... Out of the 16 bands, 2 were completely blank (only 0s on the images and weight maps) for the cutouts I considered. And it would seem that SE++ doesn't perform model fitting when one of the measurement images is completely blank. All of the sources ended with fmf_stop_reason = 7 and the values for the parameters of my model (Sersic profile) were all set at their initial value in the output catalog. Is this normal or known ? Or is it something wrong with my Python configuration file ?

Which pixels are discarded or not depends much on your settings for the weight-threshold. So it could well be that your zero pixels are in the fit and the minimization just does not work. You can check the stop reasons in the levmar documentation.

astrorama / SourceXtractorPlusPlus

SE++ doesn't always use full CPU power #580