Closed ayllon closed 2 years ago
The good news, I can reproduce. In other good news, the number of objects in flight is limited like the throttle branch is supposed to do (at around 6.5k sources / 5.5k groups).
This may be something else.
That run was for #382 which was defined in mid-July. Then it did run through, now it doesn't in develop. So the changes to develop since then must be responsible.
Argh, my mistake! I think I did Ctrl+X instead of Ctrl+C 🤦🏼
Yet another reason not to copy-paste code 😅
At least it is an easy fix. Let me verify and I will submit a patch.
I'd say that's fixed.
I don't know why the tools I am using didn't catch that leak 😞
Looks like this is the breakthrough: The red curve corresponds to the current status of develop and consumes almost a factor of 2 less RAM!! The runs has not finished yet (gnawing on the last object, see #382). We are down from ~10GB/thread to something above 5GB/thread. That's a different ballpark now! Also the lazy_stamps #390 promise to give some more relieve.
Looks like the best news since this ticket was created (3 months ago)!
Nice 😄
Surprisingly, though, for #382 I get this
It doesn't seem to suffer that much with the last source 🤔
Your run is a bit quicker as well. I assume you ran it on a different machine. How many threads did you use?
Also I have different object numbers detected/deblended/measured = 147819/52038/52038 at the end.
How many threads did you use?
o2g
.TBH I was hitting a problem with the moffat fitting and I wondered how you managed to get it running. Maybe I do not have exactly the same images? Although I got them from irod.
Hmmmm, I am afraid I was a bit too optimistic on Friday. 0.15 is not really a good reference, already in July there was already improvement. Comparing the "classical" Disk+Bulge fit to the dataset (2.July <--> now) it is:
The same is true for the Single Sersic fit https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-921588312 . So the throttle improves, but not really a lot. Shoot.
Hier the comparsion July vs. now for the Sersic fit:
The last plot is for what use case? #382?
The last plot is for what use case? #382?
Yes.
Following today's discussion I re-made the plot using the RSS. Also I shifted the blue curve (time*0.96-2.0) to have a better comparison:
The improvement is between 10GB at the beginning and 4GB towards the end (my estimate...).
Which is consistent with my own estimation. Good 😄 That's a ~10% saving, so I'd say the queue limitation is worth it (remember it is 20% for the challenge data!)
Here the diagram for disk+bulge fitting to the 'usual' dataset:
The throttle code uses ~5-6 GB less memory, rather constant across the run.
@ayllon does it make sense to do a local merge of the lazy stamps into develop and do one of the runs using that? Will there be a problem doing that merge?
Here the Sersic fitting with the lazy stamps on the big dataset: The improvement on the memory seems not to be big, but it is quicker!
Interesting!
Here a comparison of before (see above) and after #422: There are 11 sources skipped due to memory issues in levmar.
Here a RAM comparison for the data of #384: The blue curve had crashed at its end, and the projected runtime is similar to the red one. The red curve includes meta-iterations and converged for all sources.
I guess, @mkuemmel ?
Sure, no problem.
This is to follow up the findings:
I can reproduce easily on my laptop. Indeed, memory blows way over the limit set by the tile size. However, I have found a neat way of intercepting
malloc
calls, and logging anything above 100 MiB. Since memory sanitizer & co do not seem to spot leaks, it may be due to big allocations, and not so much about leaks.So the trick is this one:
I can get useful output as
Some findings:
Lutz
uses parses in chunks of 150 MiB, since we usewidth * chunk_height
. I think up to 300 MiB may be used for a moment, sinceThresholded
allocates, andVariance
and/orImage
may have allocated. Since this is opaque, it is tricky to follow up.3 is mostly due to the PSF. Oversampling is 6x, so you get 36x number of pixels. What is worse, you have the original raster plus the padded raster required for the convolution. That's a giga for a single source per frame, which is just insane. Multiply per number of threads and kaboom.
I can see some room for improvement, but ultimately, for 3, little to do, IMHO. We can half, maybe divide by 3, but that will still be up to 100 GiB. The tile manager can't help it.
@mkuemmel, @marcschefer