Extremely high memory consumption

ayllon commented 3 years ago

This is to follow up the findings:

I can reproduce easily on my laptop. Indeed, memory blows way over the limit set by the tile size. However, I have found a neat way of intercepting malloc calls, and logging anything above 100 MiB. Since memory sanitizer & co do not seem to spot leaks, it may be due to big allocations, and not so much about leaks.

So the trick is this one:

#include "ElementsKernel/System.h"
#include <dlfcn.h>

void* malloc(size_t size) {
  static void* (*libc_malloc)(size_t size) = (void* (*)(size_t))(dlsym(RTLD_DEFAULT, "__libc_malloc"));
  if (size > 100 * 1024 * 1024) {
    auto back = Elements::System::backTrace(8, 0);
    logger.warn() << "Allocation of " << size / (1024 * 1024) << " MiB!";
    for (auto& b : back) {
      logger.warn() << "From " << b;
    }
  }
  return libc_malloc(size);
}

I can get useful output as

2021-06-10T15:55:34CEST SourceXtractor  WARN : Allocation of 144 MiB!
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #1  0x56201b07cd34 malloc  [sourcextractor++]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #2  0x7f9c2d7473d0 dlevmar_dif  [/lib64/liblevmar.so.2.6]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #3  0x7f9c2f1559f4 ModelFitting::LevmarEngine::solveProblem(ModelFitting::EngineParameterManager&, ModelFitting::ResidualEstimator&)  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libModelFitting.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #4  0x7f9c2fb8e9d0 SourceXtractor::MoffatModelFittingTask::computeProperties(SourceXtractor::SourceInterface&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #5  0x7f9c2f7771b0 SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEFramework.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #6  0x7f9c2fb93b6c SourceXtractor::MoffatModelEvaluatorTask::computeProperties(SourceXtractor::SourceInterface&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #7  0x7f9c2f7771b0 SourceXtractor::SourceWithOnDemandProperties::getProperty(SourceXtractor::PropertyId const&) const  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEFramework.so]
2021-06-10T15:55:34CEST SourceXtractor  WARN : From #8  0 local  [/home/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/build.x86_64-fc33-gcc103-dbg/lib/libSEImplementation.so]

Some findings:

Lutz uses parses in chunks of 150 MiB, since we use width * chunk_height. I think up to 300 MiB may be used for a moment, since Thresholded allocates, and Variance and/or Image may have allocated. Since this is opaque, it is tricky to follow up.
During cleaning, levmar can allocate chunks as big as 844 MiB for the Moffat fitting. There is no grouping, so I am surprised. Something to check in more detail.
During model fitting itself, I have seen allocations of up to half a giga just for the raster.

3 is mostly due to the PSF. Oversampling is 6x, so you get 36x number of pixels. What is worse, you have the original raster plus the padded raster required for the convolution. That's a giga for a single source per frame, which is just insane. Multiply per number of threads and kaboom.

I can see some room for improvement, but ultimately, for 3, little to do, IMHO. We can half, maybe divide by 3, but that will still be up to 100 GiB. The tile manager can't help it.

@mkuemmel, @marcschefer

ayllon commented 3 years ago

Looking in more detail:

213 MiB for the raster + 225 MiB for the buffer used to convolve + 450 MiB for the intermediate buffer (of complex numbers) + 450 MiB for the kernel transformation (also, complex numbers) + 225 MiB for the re-centered kernel + 225 MiB for the padded kernel.

That's 1.7 GiB peak 😞 Some of it is transient, but I think 1.3 GiB is steady during the fit. Again, per frame.

Some possibilities:

Not use FFTW. It will be slower, but I think it should be able to reduce the memory consumption to just the 213 MiB.
As @degauden suggested, do the raster and convolution by pieces. i.e 1/4 of the target, project into the low-res raster, second 4th, project, etc. There will be redundant computations (overlaps), so also slower
Combination of 1 and 2. Would keep the memory footprint low, but be likely very slow

Any other ideas? I don't think with the current approach we can go below 0.7 GiB, assuming we can get rid of the redundant working areas.

mkuemmel commented 3 years ago

First of all its great that there is now a tool to kind of track the RAM consumption.

What I don't understand is the following:

I am running the test dataset with 4x4k and 19x19k detection image;
if the RAM consumption scales with the # of images and the raster (objects and PSF size) and # of threads then the 4x4k and the 19x19k run should consume the same amount of RAM, which clearly is not the case.

ayllon commented 3 years ago

I think that the 190k image has a few more big objects.

npixels_hist

It would also depend on the order they arrive, but it seems to be the 190k could make all threads busy with big source, while the 4k may have one of the cores doing that.

Might also be a red herring.

mkuemmel commented 3 years ago

Just an idea: Could the possible explanation for the delayed flushed to the output (see https://github.com/astrorama/SourceXtractorPlusPlus/issues/362#issuecomment-859655502) also have an impact here? If memory can only be released when the results are flushed and the flush does not happen due to a massive object earlier in the sequence, then the RAM could accumulate to high numbers....

mkuemmel commented 3 years ago

The test tun on the full 19x19k image with Sersic fitting needed two days for all objects except 1. The last object took another additional two days with ~20k results not flushed and occupying still 140GB of RAM.

After finishing Sersic I tried running Disk+Bulge on the entire 19x19k image. It dumped after a couple of hours. 350GB RAM is just not enough :-)

ayllon commented 3 years ago

I have made some tests, and it seems each object consumes roughly 11K while not flushed. Probably I am missing bytes, though. Being conservative, 20k sources should "only" take 400 MiB of memory.

On one hand, the fact that Sersic-only consumed less than half makes me think that the problem lies on the model fitting. On the other hand, if only one source is being fitted, the memory should not stay at 140 GiB!

Or we leak a lot of small objects, or a handful of very big ones, or there is heap fragmentation (but that bad?).

marcschefer commented 3 years ago

Have we confirmed this happens regardless of whether we use GSL or levmar?

mkuemmel commented 3 years ago

I have not run on gsl and monitored the RAM consumption, especially not for a large image / many objects where this comes into effect.

I'll do that the latest tomorrow. It's a bit tricky. When running on the 19x19k and the RAM is used up you get warning emails and some other users are concerned that this might tear down other services (which it is not doing).

mkuemmel commented 3 years ago

I checked and did a Disk+Bulge fitting on a 8x8k subimage with levmar and gsl minimization. I monitored the RA consumption and that was about the same, reaching up to 55GB (there are ~9k objects).

marcschefer commented 3 years ago

Thanks, I think this is useful information, as we can rule out levmar being the cause of the problem!

ayllon commented 3 years ago

I do not think it is a leak either. I have compiled with tcmalloc and run with their leak checker.

I mean, there are some leaks, but as far as I can tell, they relate to python initialization, and maybe one or two minor in the wcs part. But that would leak a handful of bytes per image at most, and would not pile up during the runtime.

Of course, this only means there are no dangling pointers, or frees we forgot to do (that's what smart pointers are there to do, after all!). This does not mean that image chunks or whatever exist for longer than they should.

(Although I have my doubts too, I added a couple of asserts making sure that a released tile is not kept alive by someone else, and it does not seem to be the case)

mkuemmel commented 3 years ago

Most/all of the memory is administrated by smart pointers. Assuming that their associated memory is indeed freed when no reference exists anymore a logical reason for the RAM consumption then is that references exists longer than we think and than is necessary. I made a run yesterday monitoring:

deblended and measured objects
RAM consumption
the # of rows in the output table

Plotting the # of objects in memory (measured objects - objects in output table) vs. RAM gives this: flushing

While the correlation is less strong than I hoped there is a correlation. This means making sure that objects are flushed when intended (see #362) should at least help.

Concerning the correlation, a strong correlation is expected only if the RAM consumption per object is about the same. But this is clearly not the case, in fact the large and difficult to analyze objects are the reason x-axis values >300. Even when the results are flushed to the output some values do remain to construct the model- and residual-checkimages (which were not constructed in this run). Maybe these classes still 'hide' some pointers to large chunks of RAM?

ayllon commented 3 years ago

Nice plot, thanks! Another explanation for the correlation is that fitting big cutouts takes a lot of memory and time, blocking the sources that came before them.

In any case, this is a good motivation to allow immediate flushing, at least it should confirm and/or rule out hypothesis. I will add it, should be easy to do.

By the way, I tried with the pyston branch and I have the same issue, so I think this confirms the leak on the Python side comes from the initialization.

mkuemmel commented 3 years ago

So I ran the data set yesterday with the new unsorted option and there is the result: time_vs_ram object_vs_ram As @ayllon had suspected this is not really a breakthrough. The RAM consumption is less, but there still a large portion of the RAM consumption is not really motivated. The similarity of the pattern when plotted against # measured objects is quite striking. Some other tidbits, the source ID of the flushed objects was usually ~11-13 objects ahead. This means that most of the 16 threads work on objects that need a long time to finish. This is already the case at < 2000 objects, so there is not really an argument why more than ~20GB are being used. Also the cutout is selected against large objects, which is stars in this case.

ayllon commented 3 years ago

It seems to be indeed the convolution. Almost 12 GiB out of 17.5 GiB come from there. I attach a graph with the (live) allocations after 2 days running with tcmalloc profiler on (it is superslow compiled like that!, only 3.5% measurements done)

memprof.pdf

I am checking on the in-place transform, that should reduce some memory. Also, the convolution context

May live longer than it should
Is created nested inside dlevmar_dif

Pretty sure 2 is very much sub-optimal even if the memory does not leak.

mkuemmel commented 3 years ago

I did a run on my test dataset using the branch feature/inplace_dft: https://github.com/astrorama/SourceXtractorPlusPlus/pull/367

A comparison of the RAM consumption is: proctime_ram That branch indeed needs less RAM! The "unsorted_flush" is not available in this branch, which would further decrease the RAM consumption as the comparison of the blue and green curve shows. According to the figure "inplace_dft" is also faster, but it might be that when computing the blue and green curve the machine was a bit busier.

ayllon commented 3 years ago

Nice! Unfortunately I do not have a machine where I can risk exhausting that much memory without the sysadmins knocking at my door, so thanks for verifying it 😀

mkuemmel commented 3 years ago

That branch makes it possible to run the entire 19x19k image, which will need 5-6 times that RAM, on the SDC machine. Right now that does not make sense since the results on the subimage are not really good :-( , a discussion we are having in a different context.

mkuemmel commented 3 years ago

Just to give some more input and information, I ran yesterday the dataset fitting Disk+Sersic models to the data. The RAM curve is: disk_sersic

The red line is new, the blue is there mostly for reference and corresponds to the red line above in https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-866644910

Not sure what to interpret into that.

ayllon commented 3 years ago

Disk+Bulge has two more free parameters per frame, doesn't it? It has a multiplicative effect on the memory used by the fitting.

On top of that, there is a convolution context per model, so if you have two models (i.e. disk + bulge) instead of one (just Sersic) the memory used is doubled. This is due to the design of the model fitting, where there is one psf per model, not per object. There should be some place for reduction, sharing the "work area" per source, or even per frame.

mkuemmel commented 3 years ago

Disk+Sersic (red line above) has one parameter more than Disk+Bulge (blue line above).

In that sense the # of models is the same and the # of parameters one more.

There is one further difference, in Disk_bulge I asked for model- and residual-checkimages, which may RAM consumption and processing time.

mkuemmel commented 3 years ago

Here the figure for the RAM consumption during a Disk+Sersic run on the entire 19x19k image: ram_consumption_DS

Note that the run broke 50% through due to https://github.com/astrorama/SourceXtractorPlusPlus/issues/369 .

mkuemmel commented 3 years ago

Here a comparison of the RAM consumption for the Disk+Sersic fitting and the Disk+Bulge fraction fitting as a function of time: ram_comparison_I And here as a function of the fraction of processed objects. This means that about the same objects are processed: ram_comparison_II The diagrams end when the runs break.

To me there seems to be a difference in the sense that with Disk+Bulge no significant amount of RAM gets released while for Disk+Sersic the line goes around some kind of average line.

ayllon commented 3 years ago

Last thing I have tried is to get the malloc internal stats using malloc_info. By the time sourcextractor is processing the last, long lasting, source, the RSS is around 25.4 GiB.

I am having trouble finding what the output of malloc_info actually means, but as far as I can tell, almost 10 GiB is actually "free", but not returned to the system.

What seems to match with the reported memory by htop is actually the sum of mmap (around 9 GiB) and total (around 17). Both add up to the total RSS.

So, in summary, RSS is 25.4 GiB, of which 10 are actually free, but not given back to the system. Of the ~15.4 GiB in use, we can expect 6 GiB to belong to the tile manager. That leaves us with around 9 GiB to be accounted for.

The last source is (super-sampled) 8553*8553 pixels, for the raster. For the convolution, with padding and all that, 290 MiB, three times (kernel, image and buffer), 870 MiB. Plus 185 MiB for levmar workspace. 1 GiB, give or take.

I don't know where the last 8 GiB are gone.

ayllon commented 3 years ago

After several days running with tcmalloc and dumping plots with leaked data, it seems as we leak just a tiny bit of data, but its comes from dlevmar_dif -> dlevmar_trans_mat_mat_mult -> dgemm_ -> mkl_blas_dgemm

The leak is currently less than 1 MiB, though, and it is at 2% objects measured, so I do not think it can pile up much.

marcschefer commented 3 years ago

Isn't it possible even a small leak could contribute to memory fragmentation problems?

ayllon commented 3 years ago

I guess it could contribute to the 10 GiB "free" but not returned to the system, true.

ayllon commented 3 years ago

After 24 more hours, the leak report hasn't change, so it may be an initialization issue or something like that. RSS hasn't moved since yesterday. If anything, it may have shrink a few MiB.

RSS is 18.1G at 2.67% measured. I am going to cancel this run because it takes forever with the heap profiler. It has been running for 5 days already.

mkuemmel commented 3 years ago

The Disk+Bulge run is now for over 24h on the last object only. I'll let it run, but don't expect any good from it. We have had cases when it just broke after working for 2 days on the last object.

Here RAM consumption vs. time: time_vs_ram Since it is still running (but the log is not updated), the blue line would continue for >24h at 84GB, which is about the last y-value. It would be nice to know whether the blues curve comes gently down (as the red one) when finishing, let's see.

Here RAM consumption vs. the fraction of processed sources: $fract_vs_ram$ The blue line does not go to 1.0 due to the one missing source.

Here the fraction of processed sources vs. time: $time_vs_fract$ Disk+Bulge is consistently slower than Disk+Sersic.

We can discuss more on that in the next telecon tomorrow.

Here the data for the plots above, so everyone can make plots: db_full_run.txt ds_full_run.txt

The plan is also to make a Sersic fitting on the entire field. Since 24% of the available RAM is currently bound to the still running Disk+Bulge fitting, I'll wait another day and maybe decide to stop it if it does not yet finish.

mkuemmel commented 3 years ago

Here a comparison of the # of iterations in Disk+Bulge (x-axis) and Disk+Sersic (y-axis) after cross-matching the two catalogs: DB_DS_cmatch

I just checked, if I take the 10% brightest sources the plot does not change much. The point-cloud is less dense, but the correlation is not higher.

mkuemmel commented 3 years ago

Here now the RAM plots including the Sersic fitting (green). Note that:

also here I had to interrupt since the last source was running for six hours, so the fraction at the x-axis does not all go to 1.0 for blue and green;
galaxy photometry is not really good

time_vs_ram $fract_vs_ram$ $time_vs_fract$

The Sersic fitting is quicker and uses less RAM, which is to be expected. In all runs 10GB are bound to the tile-memory.

Please let me know your thoughts!

Here the data for the Sersic fitting: ser_full_run.txt

ayllon commented 3 years ago

The shape of the three curves, when fraction of sources are on x, are suspiciously similar, aren't they?

El vie., 9 jul. 2021 9:56, mkuemmel @.***> escribió:

Here now the RAM plots including the Sersic fitting (green). Note that:

also here I had to interrupt since the last source was running for six hours, so the fraction at the x-axis does not all go to 1.0 for blue and green;

galaxy photometry is not really good

[image: time_vs_ram] https://user-images.githubusercontent.com/10818276/125042513-35f89100-e09a-11eb-8b6f-6510ac6d57d3.png [image: fract_vs_ram] https://user-images.githubusercontent.com/10818276/125042516-36912780-e09a-11eb-84a4-0fd08bd6b3f7.png [image: time_vs_fract] https://user-images.githubusercontent.com/10818276/125042517-36912780-e09a-11eb-8dd5-5581a0d3a975.png

The Sersic fitting is quicker and uses less RAM, which is to be expected. In all runs 10GB are bound to the tile-memory.

Please let me know your thoughts!

Here the data for the Sersic fitting: ser_full_run.txt https://github.com/astrorama/SourceXtractorPlusPlus/files/6789760/ser_full_run.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-876992819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKYMEKBWVYLYGWHU7QA5Y3TW2TRRANCNFSM46OTQD6Q .

mkuemmel commented 3 years ago

The curves definitely have some similarities. But isn't that expected? In the detection everything is identical, which means that the sequence of objects sent to measurements is the same. Assuming the RAM consumption is triggered by some big objects they count against the RAM at about the same time.

ayllon commented 3 years ago

This are my result using 4 threads, and tcmalloc.

sersic_4thread_tcmalloc

Curiously, it does finish, I do not have to kill it.

I just started with 16, let's see how the memory usage changes.

mkuemmel commented 3 years ago

I think you do not have the full dataset but only the 4x4k or 8x8k and the corresponding single images. Maybe we should try to find a way to transfer everything...

ayllon commented 3 years ago

While I get them, here is tcmalloc, 16 threads

16threads_tcmalloc

Not so surprisingly the memory consumption is higher. I find interesting, though, the fact that the RSS curve is smoother that @mkuemmel runs. What is bad, is that it stays high while processing the last source. Same as with glibc's malloc.

mkuemmel commented 3 years ago

I have to admit that what is labelled "RAM" in my figures is the value VSZ from 'pidstat'. I googled at some point and could not figure out what's more appropriate.

Also I found that VSZ and RSS correlate rather well: vsz_rss

I can change to RSS in the future...

ayllon commented 3 years ago

Now I have the full dataset, I can see rapidly the memory consumption growing to 49 GiB. More than 15 GiB are used in 247 memory mapped regions of 65028 KB. Not sure if that's 1000 or 1024 bytes, but that's only enough for a square image of 4080x4080 pixels. Suspiciously close to the size of a tile, but not quite (4096*4096*4, right?). That's weird.

In summary, there are almost 37 GiB in mapped regions smaller than the tile size, and "only" around 9.3 GiB in regions big enough to hold a tile (or several) :confused:

32.8 GiB are spend on regions almost, but not quite, as big as a tile.

I guess I will have to run with another tile size and see if the correlation stays the same. Also, catching mmaps of precisely 65028 bytes may help finding what's going on.

ayllon commented 3 years ago

chap is an amazing tool! I am using it to inspect a coredump, and so far I can see:

Pattern %ContainerPythonObject has 2853321 instances taking 0x8090d508(2,156,975,368) bytes

2 GiB seem to be in use by Python containers (??)

Pattern %VectorBody has 2285781 instances taking 0xb3ab22040(48,229,392,448) bytes.

45 GiB (!!) are in use by vectors. To be expected, really, but good to confirm.

These are not classic leaks. Someone is pointing at those.

ayllon commented 3 years ago

Since it is Friday 16.00 and I am about to leave for vacations, let me summarize some findings:

The cumulative memory consumption vs the container size looks like this:

rss_vector

There is a jump at precisely the tile size very roughly matching the tile memory limit. That's good and to be expected.

To the right we have few allocations consuming a lot of memory. To the left we have many many small allocations consuming too a lot of memory.

I could somehow understand some of the big allocations (model fitting rastering, residual, levmar working area, etc...) but the miriad of relatively small accounting for almost half of the total consumption is weird.

I am trying to understand where do the came from, but I have trouble understanding the output. I am betting on PixelCoordinates for the smallest population (makes sense, every detected object holds the coordinates for every detected pixel). They are not problematic in terms of total size (ballpark < 100 MiB).

Anyhow, one interesting thing reported by chap is the corruption of several malloc fast bins. That can't be good by any measure. I have re-run enabling malloc self-test (-lmcheck) and I get a kaboom 💥 pretty soon, pretty much after the background has been computed:

memory clobbered before allocated block

[snip]

#4  0x00007f443e0bb2fe in mabort (status=<optimized out>) at mcheck.c:360
#5  0x00007f443e0bb3cb in checkhdr (hdr=hdr@entry=0x55eee02af2f0) at mcheck.c:111
#6  0x00007f443e0bb7cd in checkhdr (hdr=0x55eee02af2f0) at mcheck.c:184
#7  freehook (ptr=0x55eee02af320, 
    caller=0x7f443e4e7e80 <std::vector<std::type_index, std::allocator<std::type_index> >::_M_realloc_insert<std::type_index&>(__gnu_cxx::__normal_iterator<std::type_index*, std::vector<std::type_index, std::allocator<std::type_index> > >, std::type_index&)+350>) at mcheck.c:184
#8  0x00007f443e4e7e80 in __gnu_cxx::new_allocator<std::type_index>::deallocate (__p=0x55eee02af320, this=0x7ffca0e631a0)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/ext/new_allocator.h:119
#9  std::allocator_traits<std::allocator<std::type_index> >::deallocate (__n=<optimized out>, __p=0x55eee02af320, __a=...)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/alloc_traits.h:470
#10 std::_Vector_base<std::type_index, std::allocator<std::type_index> >::_M_deallocate (__n=<optimized out>, __p=0x55eee02af320, this=0x7ffca0e631a0)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/stl_vector.h:351
#11 std::vector<std::type_index, std::allocator<std::type_index> >::_M_realloc_insert<std::type_index&> (this=this@entry=0x7ffca0e631a0, __position=...)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/vector.tcc:500
#12 0x00007f443e4e7ee7 in std::vector<std::type_index, std::allocator<std::type_index> >::emplace_back<std::type_index&> (this=this@entry=0x7ffca0e631a0)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/stl_iterator.h:806
#13 0x00007f443e4eb0c1 in SourceXtractor::OutputRegistry::getSourceToRowConverter (this=0x55eee012d1b0, enabled_properties=...)
    at /home/astro/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/SEFramework/src/lib/Output/OutputRegistry.cpp:45
#14 0x00007f443e86abf2 in SourceXtractor::MeasurementFactory::getMeasurement (this=0x55eee0133920)
    at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1309
#15 0x000055eededaad04 in SEMain::mainMethod (this=0x55eee0133720, args=...)

That's where the memory corruption has been detected, not where it has happened necessarily. At least is soon, so maybe compiling and running with AddressSanitizer will help identifying the source.

degauden commented 3 years ago

Please note that there is some support in Elements for building with the -fsanitize options of the gcc compiler (it just came back to my mind right now): https://github.com/astrorama/Elements/blob/35ee9d3a91dda3454a9b6dd0167a7449f56eddb1/cmake/ElementsBuildFlags.cmake#L274.

SANITIZE_OPTIONS activates it and SANITIZE_STYLE allows to pass the needed paramter to the gcc -fsanitize option

ayllon commented 3 years ago

Good and not so good news.

The good news is that I have identified a property that consumes a lot of memory (stamps). A tentative fix, making the property lazy (so the cutout is generated when needed, not kept) confirms this:

lazystamps

The disadvantage is that it is a tad slower (which isn't that surprising!) But it is saving around 15 GiB.

I do not expect it to impact when the image fits in memory though. My impression is that the segmentation moves the cache upwards, but sources keep piling up for the cleaning and measurements, so they will need to access tiles that were evicted. This causes contention.

We can discuss other potential fixes, but at least I think this confirms where do we have too look to save that much memory.

P.S IMHO 64 MiB tiles are probably a bit too much. Reducing their size would likely help with the contention. I am also running on NFS, so I will pay dearly for any random-ish IO.

ayllon commented 3 years ago

I have good and bad news. When compiling for release, lazy stamps is as performant as the develop, and consumes half the memory.

lazystamps

The bad news is that regardless of the branch, I get an unhelpful

WCS exception: One or more of the world coordinates were invalid

Exactly at the same point (4884 / 501499 measured). But that's unrelated.

ayllon commented 3 years ago

By the way, if the issue is the piling up of sources, I've got an idea, which is to add priorities to the thread pool. As long as there are sources to measure, they will take priority. That should allow getting rid of sources faster than queuing them.

I am running the mmap patch compiled for release in the meanwhile. Let's see how that one does.

ayllon commented 3 years ago

mmap saves peanuts, at the regime we are now mmap380

ayllon commented 3 years ago

By the way, if the issue is the piling up of sources, I've got an idea, which is to add priorities to the thread pool. As long as there are sources to measure, they will take priority. That should allow getting rid of sources faster than queuing them.

Looking good running over the challenge data.

priorities_challenge

Saves 20% of memory (peak to peak), way more than #390 for this dataset. Comparing for problem #380 is trickier, since #391 makes the program abort. I can not compare halfway, I need to run the whole process.

The last step would be to combine both changesets and see what happens and decide which is better.

ayllon commented 3 years ago

Ah! it seems I have finally found the bugger. Branch fix/throttle keeps the memory footprint hovering at the value you would, more or less, expect given the limit to the tile memory.

throttle

So the issue arises when some parts of the pipeline are faster than others. For instance, if the segmentation moves too fast, sources will pile up on the queue of the deblending. Each source has associated data (i.e., list of pixels) that can add up to a lot when there are many sources.

The priorities seemed to fix that in some cases because it put a cap on that part, but for #380, it wouldn't matter. It turns out, the deblending output queue is the one piling up sources, which is made worse given that cutouts and the like are added as properties. The memory blows up. And more cores make things worse, as sources just pile up faster on that queue.

The new branch uses a semaphore (see astrorama/Alexandria#69) to avoid accumulation. The disadvantage is that it becomes somewhat slower, of course (queueing and synchronization costs). However, I disabled FlexibleModelFitting, so the measurement threads don't have a lot to do, and the overhead is more noticeable.

ayllon commented 3 years ago

Interestingly, it is faster for the challenge.

throttle_challenge