Closed ayllon closed 2 years ago
Looking in more detail:
213 MiB for the raster + 225 MiB for the buffer used to convolve + 450 MiB for the intermediate buffer (of complex numbers) + 450 MiB for the kernel transformation (also, complex numbers) + 225 MiB for the re-centered kernel + 225 MiB for the padded kernel.
That's 1.7 GiB peak 😞 Some of it is transient, but I think 1.3 GiB is steady during the fit. Again, per frame.
Some possibilities:
Any other ideas? I don't think with the current approach we can go below 0.7 GiB, assuming we can get rid of the redundant working areas.
First of all its great that there is now a tool to kind of track the RAM consumption.
What I don't understand is the following:
I think that the 190k image has a few more big objects.
It would also depend on the order they arrive, but it seems to be the 190k could make all threads busy with big source, while the 4k may have one of the cores doing that.
Might also be a red herring.
Just an idea: Could the possible explanation for the delayed flushed to the output (see https://github.com/astrorama/SourceXtractorPlusPlus/issues/362#issuecomment-859655502) also have an impact here? If memory can only be released when the results are flushed and the flush does not happen due to a massive object earlier in the sequence, then the RAM could accumulate to high numbers....
The test tun on the full 19x19k image with Sersic fitting needed two days for all objects except 1. The last object took another additional two days with ~20k results not flushed and occupying still 140GB of RAM.
After finishing Sersic I tried running Disk+Bulge on the entire 19x19k image. It dumped after a couple of hours. 350GB RAM is just not enough :-)
I have made some tests, and it seems each object consumes roughly 11K while not flushed. Probably I am missing bytes, though. Being conservative, 20k sources should "only" take 400 MiB of memory.
On one hand, the fact that Sersic-only consumed less than half makes me think that the problem lies on the model fitting. On the other hand, if only one source is being fitted, the memory should not stay at 140 GiB!
Or we leak a lot of small objects, or a handful of very big ones, or there is heap fragmentation (but that bad?).
Have we confirmed this happens regardless of whether we use GSL or levmar?
I have not run on gsl and monitored the RAM consumption, especially not for a large image / many objects where this comes into effect.
I'll do that the latest tomorrow. It's a bit tricky. When running on the 19x19k and the RAM is used up you get warning emails and some other users are concerned that this might tear down other services (which it is not doing).
I checked and did a Disk+Bulge fitting on a 8x8k subimage with levmar and gsl minimization. I monitored the RA consumption and that was about the same, reaching up to 55GB (there are ~9k objects).
Thanks, I think this is useful information, as we can rule out levmar being the cause of the problem!
I do not think it is a leak either. I have compiled with tcmalloc and run with their leak checker.
I mean, there are some leaks, but as far as I can tell, they relate to python initialization, and maybe one or two minor in the wcs part. But that would leak a handful of bytes per image at most, and would not pile up during the runtime.
Of course, this only means there are no dangling pointers, or frees we forgot to do (that's what smart pointers are there to do, after all!). This does not mean that image chunks or whatever exist for longer than they should.
(Although I have my doubts too, I added a couple of asserts making sure that a released tile is not kept alive by someone else, and it does not seem to be the case)
Most/all of the memory is administrated by smart pointers. Assuming that their associated memory is indeed freed when no reference exists anymore a logical reason for the RAM consumption then is that references exists longer than we think and than is necessary. I made a run yesterday monitoring:
Plotting the # of objects in memory (measured objects - objects in output table) vs. RAM gives this:
While the correlation is less strong than I hoped there is a correlation. This means making sure that objects are flushed when intended (see #362) should at least help.
Concerning the correlation, a strong correlation is expected only if the RAM consumption per object is about the same. But this is clearly not the case, in fact the large and difficult to analyze objects are the reason x-axis values >300. Even when the results are flushed to the output some values do remain to construct the model- and residual-checkimages (which were not constructed in this run). Maybe these classes still 'hide' some pointers to large chunks of RAM?
Nice plot, thanks! Another explanation for the correlation is that fitting big cutouts takes a lot of memory and time, blocking the sources that came before them.
In any case, this is a good motivation to allow immediate flushing, at least it should confirm and/or rule out hypothesis. I will add it, should be easy to do.
By the way, I tried with the pyston branch and I have the same issue, so I think this confirms the leak on the Python side comes from the initialization.
So I ran the data set yesterday with the new unsorted option and there is the result: As @ayllon had suspected this is not really a breakthrough. The RAM consumption is less, but there still a large portion of the RAM consumption is not really motivated. The similarity of the pattern when plotted against # measured objects is quite striking. Some other tidbits, the source ID of the flushed objects was usually ~11-13 objects ahead. This means that most of the 16 threads work on objects that need a long time to finish. This is already the case at < 2000 objects, so there is not really an argument why more than ~20GB are being used. Also the cutout is selected against large objects, which is stars in this case.
It seems to be indeed the convolution. Almost 12 GiB out of 17.5 GiB come from there. I attach a graph with the (live) allocations after 2 days running with tcmalloc profiler on (it is superslow compiled like that!, only 3.5% measurements done)
I am checking on the in-place transform, that should reduce some memory. Also, the convolution context
dlevmar_dif
Pretty sure 2 is very much sub-optimal even if the memory does not leak.
I did a run on my test dataset using the branch feature/inplace_dft: https://github.com/astrorama/SourceXtractorPlusPlus/pull/367
A comparison of the RAM consumption is: That branch indeed needs less RAM! The "unsorted_flush" is not available in this branch, which would further decrease the RAM consumption as the comparison of the blue and green curve shows. According to the figure "inplace_dft" is also faster, but it might be that when computing the blue and green curve the machine was a bit busier.
Nice! Unfortunately I do not have a machine where I can risk exhausting that much memory without the sysadmins knocking at my door, so thanks for verifying it 😀
That branch makes it possible to run the entire 19x19k image, which will need 5-6 times that RAM, on the SDC machine. Right now that does not make sense since the results on the subimage are not really good :-( , a discussion we are having in a different context.
Just to give some more input and information, I ran yesterday the dataset fitting Disk+Sersic models to the data. The RAM curve is:
The red line is new, the blue is there mostly for reference and corresponds to the red line above in https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-866644910
Not sure what to interpret into that.
Disk+Bulge has two more free parameters per frame, doesn't it? It has a multiplicative effect on the memory used by the fitting.
On top of that, there is a convolution context per model, so if you have two models (i.e. disk + bulge) instead of one (just Sersic) the memory used is doubled. This is due to the design of the model fitting, where there is one psf per model, not per object. There should be some place for reduction, sharing the "work area" per source, or even per frame.
Disk+Sersic (red line above) has one parameter more than Disk+Bulge (blue line above).
In that sense the # of models is the same and the # of parameters one more.
There is one further difference, in Disk_bulge I asked for model- and residual-checkimages, which may RAM consumption and processing time.
Here the figure for the RAM consumption during a Disk+Sersic run on the entire 19x19k image:
Note that the run broke 50% through due to https://github.com/astrorama/SourceXtractorPlusPlus/issues/369 .
Here a comparison of the RAM consumption for the Disk+Sersic fitting and the Disk+Bulge fraction fitting as a function of time: And here as a function of the fraction of processed objects. This means that about the same objects are processed: The diagrams end when the runs break.
To me there seems to be a difference in the sense that with Disk+Bulge no significant amount of RAM gets released while for Disk+Sersic the line goes around some kind of average line.
Last thing I have tried is to get the malloc internal stats using malloc_info
. By the time sourcextractor is processing the last, long lasting, source, the RSS is around 25.4 GiB.
I am having trouble finding what the output of malloc_info
actually means, but as far as I can tell, almost 10 GiB is actually "free", but not returned to the system.
What seems to match with the reported memory by htop
is actually the sum of mmap
(around 9 GiB) and total
(around 17). Both add up to the total RSS.
So, in summary, RSS is 25.4 GiB, of which 10 are actually free, but not given back to the system. Of the ~15.4 GiB in use, we can expect 6 GiB to belong to the tile manager. That leaves us with around 9 GiB to be accounted for.
The last source is (super-sampled) 8553*8553 pixels, for the raster. For the convolution, with padding and all that, 290 MiB, three times (kernel, image and buffer), 870 MiB. Plus 185 MiB for levmar workspace. 1 GiB, give or take.
I don't know where the last 8 GiB are gone.
After several days running with tcmalloc and dumping plots with leaked data, it seems as we leak just a tiny bit of data, but its comes from dlevmar_dif -> dlevmar_trans_mat_mat_mult -> dgemm_ -> mkl_blas_dgemm
The leak is currently less than 1 MiB, though, and it is at 2% objects measured, so I do not think it can pile up much.
Isn't it possible even a small leak could contribute to memory fragmentation problems?
I guess it could contribute to the 10 GiB "free" but not returned to the system, true.
After 24 more hours, the leak report hasn't change, so it may be an initialization issue or something like that. RSS hasn't moved since yesterday. If anything, it may have shrink a few MiB.
RSS is 18.1G at 2.67% measured. I am going to cancel this run because it takes forever with the heap profiler. It has been running for 5 days already.
The Disk+Bulge run is now for over 24h on the last object only. I'll let it run, but don't expect any good from it. We have had cases when it just broke after working for 2 days on the last object.
Here RAM consumption vs. time: Since it is still running (but the log is not updated), the blue line would continue for >24h at 84GB, which is about the last y-value. It would be nice to know whether the blues curve comes gently down (as the red one) when finishing, let's see.
Here RAM consumption vs. the fraction of processed sources: The blue line does not go to 1.0 due to the one missing source.
Here the fraction of processed sources vs. time: Disk+Bulge is consistently slower than Disk+Sersic.
We can discuss more on that in the next telecon tomorrow.
Here the data for the plots above, so everyone can make plots: db_full_run.txt ds_full_run.txt
The plan is also to make a Sersic fitting on the entire field. Since 24% of the available RAM is currently bound to the still running Disk+Bulge fitting, I'll wait another day and maybe decide to stop it if it does not yet finish.
Here a comparison of the # of iterations in Disk+Bulge (x-axis) and Disk+Sersic (y-axis) after cross-matching the two catalogs:
I just checked, if I take the 10% brightest sources the plot does not change much. The point-cloud is less dense, but the correlation is not higher.
Here now the RAM plots including the Sersic fitting (green). Note that:
The Sersic fitting is quicker and uses less RAM, which is to be expected. In all runs 10GB are bound to the tile-memory.
Please let me know your thoughts!
Here the data for the Sersic fitting: ser_full_run.txt
The shape of the three curves, when fraction of sources are on x, are suspiciously similar, aren't they?
El vie., 9 jul. 2021 9:56, mkuemmel @.***> escribió:
Here now the RAM plots including the Sersic fitting (green). Note that:
- also here I had to interrupt since the last source was running for six hours, so the fraction at the x-axis does not all go to 1.0 for blue and green;
- galaxy photometry is not really good
[image: time_vs_ram] https://user-images.githubusercontent.com/10818276/125042513-35f89100-e09a-11eb-8b6f-6510ac6d57d3.png [image: fract_vs_ram] https://user-images.githubusercontent.com/10818276/125042516-36912780-e09a-11eb-84a4-0fd08bd6b3f7.png [image: time_vs_fract] https://user-images.githubusercontent.com/10818276/125042517-36912780-e09a-11eb-8dd5-5581a0d3a975.png
The Sersic fitting is quicker and uses less RAM, which is to be expected. In all runs 10GB are bound to the tile-memory.
Please let me know your thoughts!
Here the data for the Sersic fitting: ser_full_run.txt https://github.com/astrorama/SourceXtractorPlusPlus/files/6789760/ser_full_run.txt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/astrorama/SourceXtractorPlusPlus/issues/361#issuecomment-876992819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKYMEKBWVYLYGWHU7QA5Y3TW2TRRANCNFSM46OTQD6Q .
The curves definitely have some similarities. But isn't that expected? In the detection everything is identical, which means that the sequence of objects sent to measurements is the same. Assuming the RAM consumption is triggered by some big objects they count against the RAM at about the same time.
This are my result using 4 threads, and tcmalloc.
Curiously, it does finish, I do not have to kill it.
I just started with 16, let's see how the memory usage changes.
I think you do not have the full dataset but only the 4x4k or 8x8k and the corresponding single images. Maybe we should try to find a way to transfer everything...
While I get them, here is tcmalloc, 16 threads
Not so surprisingly the memory consumption is higher. I find interesting, though, the fact that the RSS curve is smoother that @mkuemmel runs. What is bad, is that it stays high while processing the last source. Same as with glibc's malloc.
I have to admit that what is labelled "RAM" in my figures is the value VSZ from 'pidstat'. I googled at some point and could not figure out what's more appropriate.
Also I found that VSZ and RSS correlate rather well:
I can change to RSS in the future...
Now I have the full dataset, I can see rapidly the memory consumption growing to 49 GiB. More than 15 GiB are used in 247 memory mapped regions of 65028 KB. Not sure if that's 1000 or 1024 bytes, but that's only enough for a square image of 4080x4080 pixels. Suspiciously close to the size of a tile, but not quite (4096*4096*4, right?). That's weird.
In summary, there are almost 37 GiB in mapped regions smaller than the tile size, and "only" around 9.3 GiB in regions big enough to hold a tile (or several) :confused:
32.8 GiB are spend on regions almost, but not quite, as big as a tile.
I guess I will have to run with another tile size and see if the correlation stays the same. Also, catching mmaps of precisely 65028 bytes may help finding what's going on.
chap is an amazing tool! I am using it to inspect a coredump, and so far I can see:
Pattern %ContainerPythonObject has 2853321 instances taking 0x8090d508(2,156,975,368) bytes
2 GiB seem to be in use by Python containers (??)
Pattern %VectorBody has 2285781 instances taking 0xb3ab22040(48,229,392,448) bytes.
45 GiB (!!) are in use by vectors. To be expected, really, but good to confirm.
These are not classic leaks. Someone is pointing at those.
Since it is Friday 16.00 and I am about to leave for vacations, let me summarize some findings:
The cumulative memory consumption vs the container size looks like this:
There is a jump at precisely the tile size very roughly matching the tile memory limit. That's good and to be expected.
To the right we have few allocations consuming a lot of memory. To the left we have many many small allocations consuming too a lot of memory.
I could somehow understand some of the big allocations (model fitting rastering, residual, levmar working area, etc...) but the miriad of relatively small accounting for almost half of the total consumption is weird.
I am trying to understand where do the came from, but I have trouble understanding the output. I am betting on PixelCoordinates
for the smallest population (makes sense, every detected object holds the coordinates for every detected pixel). They are not problematic in terms of total size (ballpark < 100 MiB).
Anyhow, one interesting thing reported by chap is the corruption of several malloc fast bins. That can't be good by any measure.
I have re-run enabling malloc self-test (-lmcheck
) and I get a kaboom 💥 pretty soon, pretty much after the background has been computed:
memory clobbered before allocated block
[snip]
#4 0x00007f443e0bb2fe in mabort (status=<optimized out>) at mcheck.c:360
#5 0x00007f443e0bb3cb in checkhdr (hdr=hdr@entry=0x55eee02af2f0) at mcheck.c:111
#6 0x00007f443e0bb7cd in checkhdr (hdr=0x55eee02af2f0) at mcheck.c:184
#7 freehook (ptr=0x55eee02af320,
caller=0x7f443e4e7e80 <std::vector<std::type_index, std::allocator<std::type_index> >::_M_realloc_insert<std::type_index&>(__gnu_cxx::__normal_iterator<std::type_index*, std::vector<std::type_index, std::allocator<std::type_index> > >, std::type_index&)+350>) at mcheck.c:184
#8 0x00007f443e4e7e80 in __gnu_cxx::new_allocator<std::type_index>::deallocate (__p=0x55eee02af320, this=0x7ffca0e631a0)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/ext/new_allocator.h:119
#9 std::allocator_traits<std::allocator<std::type_index> >::deallocate (__n=<optimized out>, __p=0x55eee02af320, __a=...)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/alloc_traits.h:470
#10 std::_Vector_base<std::type_index, std::allocator<std::type_index> >::_M_deallocate (__n=<optimized out>, __p=0x55eee02af320, this=0x7ffca0e631a0)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/stl_vector.h:351
#11 std::vector<std::type_index, std::allocator<std::type_index> >::_M_realloc_insert<std::type_index&> (this=this@entry=0x7ffca0e631a0, __position=...)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/vector.tcc:500
#12 0x00007f443e4e7ee7 in std::vector<std::type_index, std::allocator<std::type_index> >::emplace_back<std::type_index&> (this=this@entry=0x7ffca0e631a0)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/stl_iterator.h:806
#13 0x00007f443e4eb0c1 in SourceXtractor::OutputRegistry::getSourceToRowConverter (this=0x55eee012d1b0, enabled_properties=...)
at /home/astro/aalvarez/Work/Projects/SourceXtractorPlusPlus/SourceXtractorPlusPlus/SEFramework/src/lib/Output/OutputRegistry.cpp:45
#14 0x00007f443e86abf2 in SourceXtractor::MeasurementFactory::getMeasurement (this=0x55eee0133920)
at /home/astro/aalvarez/miniconda3/envs/build/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1309
#15 0x000055eededaad04 in SEMain::mainMethod (this=0x55eee0133720, args=...)
That's where the memory corruption has been detected, not where it has happened necessarily. At least is soon, so maybe compiling and running with AddressSanitizer will help identifying the source.
Please note that there is some support in Elements for building with the -fsanitize options of the gcc compiler (it just came back to my mind right now): https://github.com/astrorama/Elements/blob/35ee9d3a91dda3454a9b6dd0167a7449f56eddb1/cmake/ElementsBuildFlags.cmake#L274.
SANITIZE_OPTIONS activates it and SANITIZE_STYLE allows to pass the needed paramter to the gcc -fsanitize option
Good and not so good news.
The good news is that I have identified a property that consumes a lot of memory (stamps). A tentative fix, making the property lazy (so the cutout is generated when needed, not kept) confirms this:
The disadvantage is that it is a tad slower (which isn't that surprising!) But it is saving around 15 GiB.
I do not expect it to impact when the image fits in memory though. My impression is that the segmentation moves the cache upwards, but sources keep piling up for the cleaning and measurements, so they will need to access tiles that were evicted. This causes contention.
We can discuss other potential fixes, but at least I think this confirms where do we have too look to save that much memory.
P.S IMHO 64 MiB tiles are probably a bit too much. Reducing their size would likely help with the contention. I am also running on NFS, so I will pay dearly for any random-ish IO.
I have good and bad news. When compiling for release, lazy stamps is as performant as the develop, and consumes half the memory.
The bad news is that regardless of the branch, I get an unhelpful
WCS exception: One or more of the world coordinates were invalid
Exactly at the same point (4884 / 501499 measured). But that's unrelated.
By the way, if the issue is the piling up of sources, I've got an idea, which is to add priorities to the thread pool. As long as there are sources to measure, they will take priority. That should allow getting rid of sources faster than queuing them.
I am running the mmap patch compiled for release in the meanwhile. Let's see how that one does.
mmap saves peanuts, at the regime we are now
By the way, if the issue is the piling up of sources, I've got an idea, which is to add priorities to the thread pool. As long as there are sources to measure, they will take priority. That should allow getting rid of sources faster than queuing them.
Looking good running over the challenge data.
Saves 20% of memory (peak to peak), way more than #390 for this dataset.
Comparing for problem #380
is trickier, since #391 makes the program abort. I can not compare halfway, I need to run the whole process.
The last step would be to combine both changesets and see what happens and decide which is better.
Ah! it seems I have finally found the bugger. Branch fix/throttle
keeps the memory footprint hovering at the value you would, more or less, expect given the limit to the tile memory.
So the issue arises when some parts of the pipeline are faster than others. For instance, if the segmentation moves too fast, sources will pile up on the queue of the deblending. Each source has associated data (i.e., list of pixels) that can add up to a lot when there are many sources.
The priorities seemed to fix that in some cases because it put a cap on that part, but for #380, it wouldn't matter. It turns out, the deblending output queue is the one piling up sources, which is made worse given that cutouts and the like are added as properties. The memory blows up. And more cores make things worse, as sources just pile up faster on that queue.
The new branch uses a semaphore (see astrorama/Alexandria#69) to avoid accumulation. The disadvantage is that it becomes somewhat slower, of course (queueing and synchronization costs).
However, I disabled FlexibleModelFitting
, so the measurement threads don't have a lot to do, and the overhead is more noticeable.
Interestingly, it is faster for the challenge.
Note that this benchmark keeps sources in memory until things are done in following detection order.
Looking forward to check this out.
Is it possible that the current branch "develop" is actually worse in its RAM consumption than 0.15 (last release)?
I ran one of the big Euclid datasets and it smashed after using all the memory:
I am running it now on the throttle branch. Lets see.
This is with the throttle branch:
It banged
This is to follow up the findings:
I can reproduce easily on my laptop. Indeed, memory blows way over the limit set by the tile size. However, I have found a neat way of intercepting
malloc
calls, and logging anything above 100 MiB. Since memory sanitizer & co do not seem to spot leaks, it may be due to big allocations, and not so much about leaks.So the trick is this one:
I can get useful output as
Some findings:
Lutz
uses parses in chunks of 150 MiB, since we usewidth * chunk_height
. I think up to 300 MiB may be used for a moment, sinceThresholded
allocates, andVariance
and/orImage
may have allocated. Since this is opaque, it is tricky to follow up.3 is mostly due to the PSF. Oversampling is 6x, so you get 36x number of pixels. What is worse, you have the original raster plus the padded raster required for the convolution. That's a giga for a single source per frame, which is just insane. Multiply per number of threads and kaboom.
I can see some room for improvement, but ultimately, for 3, little to do, IMHO. We can half, maybe divide by 3, but that will still be up to 100 GiB. The tile manager can't help it.
@mkuemmel, @marcschefer