ayllon commented 3 years ago

This is to follow up the findings:

I can reproduce easily on my laptop. Indeed, memory blows way over the limit set by the tile size. However, I have found a neat way of intercepting malloc calls, and logging anything above 100 MiB. Since memory sanitizer & co do not seem to spot leaks, it may be due to big allocations, and not so much about leaks.

So the trick is this one:

#include "ElementsKernel/System.h"
#include <dlfcn.h>

void* malloc(size_t size) {
  static void* (*libc_malloc)(size_t size) = (void* (*)(size_t))(dlsym(RTLD_DEFAULT, "__libc_malloc"));
  if (size > 100 * 1024 * 1024) {
    auto back = Elements::System::backTrace(8, 0);
    logger.warn() << "Allocation of " << size / (1024 * 1024) << " MiB!";
    for (auto& b : back) {
      logger.warn() << "From " << b;
    }
  }
  return libc_malloc(size);
}

I can get useful output as

2021-06-10T15:55:34CEST SourceXtractor  WARN : Allocation of 144 MiB!
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #1  0x56201b07cd34 malloc  [sourcextractor++]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #2  0x7f9c2d7473d0 dlevmar_dif  [/lib64/liblevmar.so.2.6]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #3  0x7f9c2f1559f4 ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&)  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libModelFitting.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #4  0x7f9c2fb8e9d0 SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #5  0x7f9c2f7771b0 SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEFramework.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #6  0x7f9c2fb93b6c SourceXtractor::MoffatModelEvaluatorTask::computeProperties(SourceXtractor::SourceInterface&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #7  0x7f9c2f7771b0 SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEFramework.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #8  0 local  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]

Some findings:

Lutz uses parses in chunks of 150 MiB, since we use width * chunk_height. I think up to 300 MiB may be used for a moment, since Thresholded allocates, and Variance and/or Image may have allocated. Since this is opaque, it is tricky to follow up.
During cleaning, levmar can allocate chunks as big as 844 MiB for the Moffat fitting. There is no grouping, so I am surprised. Something to check in more detail.
During model fitting itself, I have seen allocations of up to half a giga just for the raster.

3 is mostly due to the PSF. Oversampling is 6x, so you get 36x number of pixels. What is worse, you have the original raster plus the padded raster required for the convolution. That's a giga for a single source per frame, which is just insane. Multiply per number of threads and kaboom.

I can see some room for improvement, but ultimately, for 3, little to do, IMHO. We can half, maybe divide by 3, but that will still be up to 100 GiB. The tile manager can't help it.

@mkuemmel, @marcschefer

ayllon commented 3 years ago

The good news, I can reproduce. In other good news, the number of objects in flight is limited like the throttle branch is supposed to do (at around 6.5k sources / 5.5k groups).

This may be something else.

mkuemmel commented 3 years ago

That run was for #382 which was defined in mid-July. Then it did run through, now it doesn't in develop. So the changes to develop since then must be responsible.

ayllon commented 3 years ago

Argh, my mistake! I think I did Ctrl+X instead of Ctrl+C 🤦🏼

Yet another reason not to copy-paste code 😅

At least it is an easy fix. Let me verify and I will submit a patch.

ayllon commented 3 years ago

I'd say that's fixed.

throttle_wcs_leak

I don't know why the tools I am using didn't catch that leak 😞

mkuemmel commented 3 years ago

Looks like this is the breakthrough: VSZ_throttle The red curve corresponds to the current status of develop and consumes almost a factor of 2 less RAM!! The runs has not finished yet (gnawing on the last object, see #382). We are down from ~10GB/thread to something above 5GB/thread. That's a different ballpark now! Also the lazy_stamps #390 promise to give some more relieve.

Looks like the best news since this ticket was created (3 months ago)!

ayllon commented 3 years ago

Nice 😄

Surprisingly, though, for #382 I get this

throttle382

It doesn't seem to suffer that much with the last source 🤔

mkuemmel commented 3 years ago

Your run is a bit quicker as well. I assume you ran it on a different machine. How many threads did you use?

Also I have different object numbers detected/deblended/measured = 147819/52038/52038 at the end.

382 might strongly depend on the circumstances and can be acceptable. I'll try to identify the "last" object in my case (not sure whether I had done that already...).

ayllon commented 3 years ago

How many threads did you use?

Compiled with o2g.

TBH I was hitting a problem with the moffat fitting and I wondered how you managed to get it running. Maybe I do not have exactly the same images? Although I got them from irod.

mkuemmel commented 3 years ago

the images are the same since the end of May;
I did not compile with o2g (would not know how to;
the # of threads are the same;
I did not use moffat fitting, maybe a difference in the configs? Anyway, looking at how smooth the last objects run through (for me) I am really wondering why the last one should take that much more time. I think there is still something, but let's be happy that the memory footprint is that much smaller and continue this at #382.

mkuemmel commented 3 years ago

Hmmmm, I am afraid I was a bit too optimistic on Friday. 0.15 is not really a good reference, already in July there was already improvement. Comparing the "classical" Disk+Bulge fit to the dataset (2.July <--> now) it is: VSZ_comparison

The same is true for the Single Sersic fit https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-921588312 . So the throttle improves, but not really a lot. Shoot.

mkuemmel commented 3 years ago

Hier the comparsion July vs. now for the Sersic fit: VSZ_SER

ayllon commented 3 years ago

The last plot is for what use case? #382?

mkuemmel commented 3 years ago

The last plot is for what use case? #382?

Yes.

mkuemmel commented 3 years ago

Following today's discussion I re-made the plot using the RSS. Also I shifted the blue curve (time*0.96-2.0) to have a better comparison: RSS_SE++

The improvement is between 10GB at the beginning and 4GB towards the end (my estimate...).

ayllon commented 3 years ago

Which is consistent with my own estimation. Good 😄 That's a ~10% saving, so I'd say the queue limitation is worth it (remember it is 20% for the challenge data!)

mkuemmel commented 3 years ago

Here the diagram for disk+bulge fitting to the 'usual' dataset: DB_FULL_RSS

The throttle code uses ~5-6 GB less memory, rather constant across the run.

mkuemmel commented 3 years ago

@ayllon does it make sense to do a local merge of the lazy stamps into develop and do one of the runs using that? Will there be a problem doing that merge?

mkuemmel commented 3 years ago

Here the Sersic fitting with the lazy stamps on the big dataset: Ser_lazy_stamps The improvement on the memory seems not to be big, but it is quicker!

Interesting!

mkuemmel commented 2 years ago

Here a comparison of before (see above) and after #422: Ser_RSS_comparison There are 11 sources skipped due to memory issues in levmar.

mkuemmel commented 2 years ago

Here a RAM comparison for the data of #384: rss_comparison The blue curve had crashed at its end, and the projected runtime is similar to the red one. The red curve includes meta-iterations and converged for all sources.

marcschefer commented 2 years ago

422 Can we consider this closed?

ayllon commented 2 years ago

I guess, @mkuemmel ?

mkuemmel commented 2 years ago

Sure, no problem.

astrorama / SourceXtractorPlusPlus

Extremely high memory consumption #361

382 might strongly depend on the circumstances and can be acceptable. I'll try to identify the "last" object in my case (not sure whether I had done that already...).

422 Can we consider this closed?