Open AstroAure opened 1 month ago
It's in our experience no problem at all to get several threads to 100%. Here the throughput dependent on the # of cores:
So up to 60 cores there is not problem.
But to get this you need to give more memory to the tile manager via the parameter --tile-memory-limit
. The default is 500MB which is way too small for your large images. Then you are heavily I/O limited.
I would start with the following settings:
--tile-memory-limit
, the other half is for overhead;--thread_count
, this should be fine at least until 32 cores or similar;--tile-size
from its default (256) to 1k or similar.Please let us know your results!
For the previous runs, I had --tile-memory-limit
set at 131072, which is way over half the RAM. Could it be an issue to put it too high ? I also had --tile-size
at 131072 for plenty of margin. By the way, I read that SE++ can use storage for temporary files if the available RAM is not enough. From my experience, when I try to run SE++ on big images, with a machine that doesn't have enough RAM, SE++ just crashes (I see the RAM usage rising and when it reaches >90%, SE++ stops). Is there a way to avoid that ?
For --thread_count
, I often have faster runs when setting it to a different value than the number of cores I have. For example, for a machine with 32 vCPUs (16 cores, 2 threads per core), it looked that 8-16 was the sweet spot for the 0.25 and 1.0 arcmin², and setting it to 128 was faster for the 4.0 arcmin². Do you remember (even roughly) the number of sources in the images you used for your plot ?
The plot was done on single epoque fitting of about 50k sources or so.
I don't think you can have too much --tile-memory-limit
.
But the unit of --tile-size
is [pix]. The image junks that are stored in RAM are tile-size x tile-size
large. A value of 131072 does not make sense, for your 10k images I would use 1k or 2k.
Another thing we found out just recently, the processing is faster if the objects that are fed to the fitting are ordered.
Now, if you do the model fitting with detection you get the ordering for free, since the detection work in a sequence over the image.
If you make model fitting without detection I would order the objects according to ra or dec or x or y. This reduces the I/O since the sequence of objects sent to the fitting are usually close and frequently on the same tile which is already in RAM, which avoids frequent data loading.
I've figured out why 16 bands was so much faster than 4 or 8... Out of the 16 bands, 2 were completely blank (only 0s on the images and weight maps) for the cutouts I considered. And it would seem that SE++ doesn't perform model fitting when one of the measurement images is completely blank. All of the sources ended with fmf_stop_reason = 7
and the values for the parameters of my model (Sersic profile) were all set at their initial value in the output catalog.
Is this normal or known ? Or is it something wrong with my Python configuration file ?
First of all, do your CPU's go to 100% now?
Thanks @mkuemmel for this insight! I do manage to get a more optimal allocation of the resources by implementing this advice!
Also, on my side I am using the ASSOC mode withoud detection, and the ordering makes so much sense. I do see a slight speed-up by. doing the ordering, so thanks for that!
First of all, do your CPU's go to 100% now?
Yes, by removing these blank images, it seems to work now ! And your insights on the parameters was very helpful, thank you.
However, the CPU usage dies off for the very last percent (last 5-10%). This is the CPU usage for a full SE++ run (from 13:20 to 14:00, the max is 100%, 90% of measured sources was reached at around 13:40) :
You have to keep in mind that each CPU runs one source. And if there is a large source or a large source group at the end of the processing that takes much longer than a typical source those ones stay and there is nothing to keep the other cores busy. So this flaring out is quite typical. If you want to confirm this you can write out an unsorted catalog with a small junksize, and the last entries should come from large objects that take very long to finish. We have been discussing changing the processing order and doing the touch ones first, which is possible for the no-detection stuff @mShuntov is doing. The flip side would likely be heavy RAM usage at the beginning, when tackling the large and complex sources, which has disadvantages for the Euclid case.
I've figured out why 16 bands was so much faster than 4 or 8... Out of the 16 bands, 2 were completely blank (only 0s on the images and weight maps) for the cutouts I considered. And it would seem that SE++ doesn't perform model fitting when one of the measurement images is completely blank. All of the sources ended with
fmf_stop_reason = 7
and the values for the parameters of my model (Sersic profile) were all set at their initial value in the output catalog. Is this normal or known ? Or is it something wrong with my Python configuration file ?
Which pixels are discarded or not depends much on your settings for the weight-threshold. So it could well be that your zero pixels are in the fit and the minimization just does not work. You can check the stop reasons in the levmar documentation.
Hi,
I'm working with @mShuntov on running SE++ on big (more than 10 000 x 10 000px) images. To make the computations faster, we're using Amazon Web Services EC2 which offers scalable cloud computing[^1]. I've worked on benchmarking SE++ on different image sizes and different EC2 machines to find the optimal machine to make SE++ run the fastest. However, I see that the CPU usage rarely reaches 100%, even on big images that take many hours to run on SE++. I tried to modify the
thread_count
parameter, but beyond a certain point it didn't seem to help.Here are my (empirical) conclusions :
thread_count
is set too low, SE++ can't use the full power of the machine and ends up being slowed down.thread_count
is set too high, the threads pile up and SE++ ends up being slowed down.thread_count
but this doesn't make SE++ faster).Here are some plots summarizing my benchmark, and a more detailed analysis :
thread_count
, making no improvment on the runtime after some point. Another surprise is to see how 16 bands is much faster than 2, 4 or 8 bands. I don't understand where this big gap comes from !thread_count
. Here the c6a.4xlarge machine is the bottleneck for such a big task and we can see that 8 bands is roughly twice as fast as 16 bands because the CPU is running at 100%. We also see the effect of disabling hyper-threading, lowering the CPU usage without sacrificing run time.Could we be enlighted on the
thread_count
parameter and why SE++ doesn't seem to always use the CPU at its full potential ?[^1]: I've written a tutorial with different bash scripts to make the use of AWS EC2 more friendly with VS Code and Jupyter notebooks : https://github.com/AstroAure/VSJupytEC2