Memory use and delete_local_fixed

jballet commented 4 years ago

The likelihood code has always used a lot of memory and this tends to get worse as we increase the data volume and number of sources. For a rather small RoI in 4FGL-DR2, the 15 components of the model maps add up to about 7 MB. With a model containing 266 point sources and 5 diffuse sources, the memory use (measured by the maximum resident set size of /usr/bin/time, or by top) is RSS = 1.58 GB. For larger RoIs it goes up to 4 or 5 GB. In order to reduce memory use, Eric has added an option delete_local_fixed=True in BinnedAnalysis. That option is supposed to release the large memory chunk used by the source models of individual fixed point sources (which do not vary during the run) after they have been summed into a "fixed sources" block.

The problem is that this option does absolutely nothing (exactly the same behavior when it is omitted). By comparison, the same call keeping only the 9 free sources + 1 fixed source (to mimic what I expect the "fixed sources" block should do) has RSS = 330 MB. We could provide the test harness but it is not very convenient because it takes about 25 mn to build the source models.

dagorym commented 4 years ago

Jean, go ahead and provide the test harness. Even if it takes a long time, it will be good to use as reference occasionally as I work through the problem.

jballet commented 4 years ago

I have worked some more to provide a faster test harness. It now runs in 3 mn, by duplicating the same component 20 times (of course the result is meaningless, but it uses the same memory as 20 independent components) and precomputing the models for all point sources. It actually uses more memory (3 GB) than the real dataset described above. Get the test harness (100 MB) there (it will be there for one week). The livetime cube is available at SLAC (/nfs/farm/g/glast/g/catalog/P8_P305/ltcube_10years_zmax105.fits). test_memory.sh calls test_memory.py twice, wihout then with the delete_local_fixed option.

jballet commented 4 years ago

Somehow the link did not work. The test harness is at ftp://ftp.cea.fr/incoming/y2k01/jbbglast/test_memory.tgz

dagorym commented 4 years ago

Sorry, I didn't see this until just now so I missed the 1 week window. Can you repost the test harness so I can download it? Thanks.

jballet commented 4 years ago

Done

dagorym commented 4 years ago

Jean, due to other issues and then family vacation, I missed downloading the test harness once again. Could you please put it up once more. This is my main focus now so hopefully third time's the charm and I won't miss it this time. Sorry about that.

jballet commented 4 years ago

Done. I have checked that it still uses as much memory in FT 1.3.5. One gets the IRF warning Newer IRF version available. Updating IRF selection to: P8R3_SOURCE_V3 but this does not prevent the test harness from running and does not affect memory use.

dagorym commented 4 years ago

Okay, I've been poking at this for a while and here's what I've found so far. First, when I run the test_memory.py script, it uses up about 2.8GB of RAM to run and as Jean pointed out, doesn't change whether the delete_local_fixed flag is set or not.

When running the program, there are basically two parts, building the model and doing the fit. After looping through and adding the 20 components, the program has only used 575 MB. The rest of the total memory usage (~2.3 GB) comes from the .fit() function call. Of those 575 MB it's broken down by the following:

a bunch of readExposureCube() calls each using 345 MB
loading the GTI data - 53.1 MB
Binned exposure objects using 34.6 MB
Some BinnedCountsMaps using 39.7 MB
Some other CountsMaps objects using 23.8 MB
lots of other little structures

Some of that might be able to be better utilized and cleaned up but it's not too bad. Let's look at the remaining 2.3 GB of memory usage.

The issue comes when the .fit() function is called. In the python part of the code, this calls the ._errors() method of the SummedLikelihood class which in turn call the minimize() function of the selected optimizer, Minuit in this case. (Note, I tried switching to NewMinuit and saw the exact same behavior that follows).

Down under all the C code in the optimizer of choice is a call to SummedLikelihood::value(). This iterates over all the components (20 BinnedLikelihood objects in this case) and calls their value() method. The BinnedLikelihood::value() method calls computeModelMap_internal() which in turn calls the fixedModelUpdated() method. This method performs a number of checks (which I don't completely understand yet) that could cause it to return but it makes it past these and gets to a point where it iterates over each source in the component (in the supplied sample data, there are 84 sources in each model), and if the spectrum is fixed (which it is for most of these), gets a pointer to the SourceMap object from the source map cache. by calling SourceMapCache::getSourceMap() passing in the Source object we are looking for the map to (by reference).

In getSourceMap(), it check to see if the map is already loaded. Sometimes it isn't and it gets created. That accounts for 166.6 MB of the 2.3 GB leaving us with 2.1 GB to figure out. If the map does exist, the SourceMap objects setSource() method is called passing in the Source object we are looking at (again by reference).

In setSource(), it first checks to see if the address of the passed in Source object matches that of the stored Source object. It does. It then checks to see if the m_model and m_sparceModel vectors both have a size of zero and apparently they do. It then checks to see if there is a filename stored and since there is, calls the readModel() method passing in the filename. This method calls clears out a bunch of data vectors (including m_model) and then calls the readImage() method with the filename.

readImage() calls FileUtils::read_fits_image_to_float_vector() passing in several parameters (including the m_model vector to be filled which is passed by reference). This method uses tip to create a image, in this case a FitsTypedImage object, and then calls the get() method passing in the m_model vector to be filled with the data. The get() method creates a temporary vector big enough to hold all the data in the map image, loads the data, and then copies the data to the m_model vector, deletes the temp vector, and returns.

That is the chain that is using up the 2.1 GB of memory in the program. It seems to be the creation of these source models for every source in every component. In this case, we have 20 components and 84 sources for 168 maps. 2.1 GB /20 components/84 sources is only 1.34 MB per map. These are floats so each data point takes 4 bytes. That means there are ~335.5 k pixels which corresponds to a square image ~580 pixels across. Which seems reasonable if it's a 60 degree diameter ROI with 0.1 degree pixels. (I haven't looked at the data to see if that's the case).

So the memory usage seems reasonable given what is happening. The question is, should something different be happening. If so, it seems like it might take a major code rework. As far as I can tell (so far, I'm still looking), the delete_local_fixed parameter doesn't seem to come into play here. Anyway, I'm going to keep looking into this but thought an update was in order.

jballet commented 4 years ago

Thank you very much, Tom, for this very detailed investigation. Do I understand correctly from what you wrote that there are I/O (in the readModel call) occurring repeatedly during the fit? This is not at all what I expected, and very worrying. I thought every I/O should have occurred only once. The models for the point sources are not read from a file, they are computed on the fly (but again, they should be computed only once). The models for the diffuse sources are read from files (the source maps). Note that when I say "only once" it may well be inside the fit, but only at the first call. Indeed the model maps should be a big part of the data structures, but they do not appear in what you identify as the initialization, which is dominated by the livetime cubes (this is normal).

dagorym commented 4 years ago

So I've dug into this a bit more using a different profiling tool that looks at individual function calls across the entire run turns up some more information. It looks like that readModel() call is happening twice for each component/source combination. There are 20x84 = 1680 such combinations and that method is being called 3240 times (so not quite 2x, but I believe the code only calls it for the fixed sources the second time). Anyway, looking at the call numbers for the various functions in the description above I'm seeing the following:

Minuit::minimize() - 1 call
SummedLikelihood::value() - 319 calls
BinnedLikelihood::value() - 6380 calls
BinnedLikelihood::computeModelMap_internal() - 6380 calls
SourceMapCache::getSourceMap() - 1,077,960 calls (I'm guessing this happens in other places as well, not just this chain)
SourceMap::SourceMap() (the constructor) - 1680 calls - so each source map is created exactly once
SourceMap::setSource() - 1,076,280 calls - this is called in nearly every getSourceMap call above except for the ones that call the SourceMap constructor so those last two calls add up the getSourceMap() call count
SourceMap::readModel() - 3240 calls - This is called by the constructor as well, not just the setSource() method so we have 1680 calls from the constructor and the remaining 1560 calls coming from the setSource calls in the fit() chain. That is 120 calls short of the full number as it is only called when the source is fixed and there are 6 free sources so 6x20 =120 times it wouldn't be called
SourceMap::readImage() - 3240 calls - one for each of the readModel() calls
FileUtils::read_fits_image_to_float_vector() - 3240 calls, one for each of the readImage() calls

So it seems that we are getting two sets of these source maps created. A first set for all the sources, and a second set for only the fixed sources. I'd attribute this to the delete_local_fixed parameter (and it still might be somehow related to this) but I've been running these with that parameter set to false, not enabled, at least not explicitly. Unfortunately, these tools only give aggregate call counts, not ordered by execution time so I still need to dig around to see exactly where these are happening, but based on the memory graph (below), it seems to happen right as the fit starts.

test_memory-time

This image shows the memory usage as a function of time. he slow ramp up is the creation of the 20 BinnedLikelihood components. The big blue section at the end is memory consumed by the source maps and that slope change is right at the start of the fit() call. So it reads everything in and then spends some time doing the fit (the flat portion at the end) before cleaning everything up.

It seems the real question is why is it loading things up twice? It seems like it has already loaded the fixed sources so why is it loading them again? And where are these objects being maintained? If we can figure that out, it should cut the memory use nearly in half at least.

Anyway, I'll keep poking into this.

dagorym commented 4 years ago

I should add that a basic memory leak check show ~114 MB of possible leaked memory (~160 kB definitely lost, ~18.5 MB indirectly lost, and ~96 MB possibly lost). Which should get looked into but is only second order to the big memory use above. The point being that the big memory use above is not due to a memory leak as it cleans itself up at the end (Hard to see on the graph but the final memory snapshot only has 158 MB in it down from the peak 2.8 GB).

jballet commented 4 years ago

Those double calls should indeed be investigated, but this was not my original concern. The principle (as I understood it) of the delete_local_fixed option is that each fixed source should be summed into the "fixed sources" block and immediately forgotten (the associated memory immediately released) so that the sum of all the model cubes for all sources never gets into memory. I think it is important that you try delete_local_fixed = True to see what changes.

dagorym commented 4 years ago

Actually, I have run it a few times with the delete_local_fixed option set to true. There is no difference in the execution. The code that removes the models is in the SourceMap::clear_model() method which is invoked (via a conditional) in the BinnedLikelihood::addFixedSource() method. While the addFixedSource() method is invoked 1560 times, clear_model() is never called. regardless of the setting of delete_local_fixed.

The conditional in question is:

    if ( !srcMap->save_model() ) {
      srcMap->clear_model( m_config.delete_local_fixed() );
    }

so apparently save_model() is returning true for some reason and the delete code is not being called. I'll have to run that down next week.

dagorym commented 4 years ago

So it turns out the function code was being called (as determined by adding a print statement). But since the clear_memory() method was declared completely in the SourceMap header, it was being inlined by the compiler which is why I wasn't seeing a function call. With that out of the way, here's what I've found.

1) setting the delete_local_fixed parameter doesn't have any effect because the clear_memory() method is being called regardless of the parameter's value. The default value for save_model() (in the conditional above) is false so the method gets called regardless, and then the other parameters that are checked inside clear_model() also have defaults that result in the triggering of the clean-up code. So really, as currently written, this parameter doesn't do anything. The default behavior is to clear them out, to get the source maps to be saved, you'd have to set delete_local_fixed to false AND save_all_srcmaps to true. Otherwise it will delete the maps.

2) That said, clear_model() code is working. If I completely disable the call to this method, you get a memory graph that looks like this: test_memory-no_clear

As you can see, that big blue area just grows and grows from the very beginning as the models are added instead of remaining small until the fit starts. So it is conserving memory in the source construction part of the code as intended. At least here.

3) The total memory usage is still the same with the clear_memory() method removed. So despite the fixed models being removed during source construction, the fitting process seems to want them all to exist so goes and reloads them all back up. This actually makes the run longer as it has to redo the map creation. Which means if we really want to reduce the memory allocation, we need to figure out why this is happening, if it actually needs to, and if not, how to check properly so these extra maps are not recreated.

jballet commented 4 years ago

Excellent diagnostic! Looks like you have identified where the problem lies. It may be that the check at fit level does not understand that everything is ready after the "fixed sources" block is prepared.

dagorym commented 4 years ago

I think I've found the culprit and a possible solution. The problem seems to be down in the SourceMap::setSource() method. In this method, it first checks to see if the source object passed in is the one already stored, if not (it isn't), it then checks to see if the m_model and m_sparceModel data structures are empty. If they are, it then either loads the data from file or creates the model. This is where the memory is being created/consumed.

However, these are the exact data structures we just emptied in the clear_model() method so we know we don't need them. The fix is to key the conditional off of not just the size of those data structures, but also the m_save_model flag. This is the one we checked (along with delete_local_fixed) to do the deletion in the first place so if m_save_model is false, we shouldn't need to reload the model data. If I simply add a check on that flag into the if() statement, I get the following memory profile: test_memory-reduced Now the peak memory usage is only 740.5 MB only ~1/4 of the 2.8 GB from before. That orange section right above the red section is the memory that was the blue section in the other posts (I can't control the color) and is only 166 MB instead of 2.1 GB. Comparing the output of the two different runs gives exactly the same results.

I need to check this against some of the other tests to make sure those results don't change either but I think this solves the major memory issue. I also take a look at the memory leaks as well but that will be a different issue (that I'll open). Once I've checked the other tests that I have, I'll build a test release (probably 2.1.0) for people to try out.

jballet commented 4 years ago

Congratulations! This is certainly the way it should be from the memory use point of view. If the result is the same and it doesn't fail anywhere, looks good!

eacharles commented 4 years ago

Excellent work Tom!

dagorym commented 4 years ago

Thanks. I've checked it with a few of the other test scripts Jean has provided for some of the other issues and those all pass as well. I'm declaring this one done. Just verifying that it builds in the pipeline and I'll push out a test version if people would like that.

dagorym commented 4 years ago

Pull request #100 merges in the fix.

dagorym commented 4 years ago

Okay, there are not test builds with this fix included for people to try out. They are in the dev channel (you need to include -c fermi/label/dev in your conda command) and are version 2.0.2 for python 3 and 1.4.2 for python 2.7.

dagorym commented 3 years ago

Turns out the fix isn't this simple. Using the patched version results in segmentation faults in other analyses. Eric posted the attached script which dies with the new version and works with the original. I have confirmed that it is the change I made that causes the segfault as backing out the change removes the error. More investigation will be needed. It looks like more detailed state checking will be needed and maybe some code restructuring as well in order to pass in additional parameters to enable full evaluation of what needs to be done.

tt.py.txt

dagorym commented 3 years ago

Jean, The 1.4.2 build version of the tools is python2.7 and has the memory fix I implemented. Which, unfortunately, doesn't cover all cases. However, if you're not seeing failures like Eric reported with that script, you can definitely use it. You might try running several ROIs with it to see if there are any problems in your use case.

I think the current conflict is coming from the use of SummedLikelihood over just a plain BinnedLikelihood analysis. In the former case, the fix seems to work fine but in the latter, it never generates the maps to begin with. The problem is that the function should really be doing different things for 3-4 different parameter settings but the parameter information isn't available in the function to make the distinction.

As Eric has mentioned, the memory management is a bit of a pain in the likelihood code. I've been looking at it more but had to work on other bugs that actually crash the software so I haven't had time to figure out how to straighten all this out. It will probably take a re-write of some of the code to pass the configuration parameters through several layers of classes to get the needed information to SourceMap class. It's definitely not going to be a simple solution. Or at least the 3-4 simple solutions I've tried haven't worked.

jballet commented 3 years ago

I have tried this today on a moderately large sample and simple cases (a few parameter freezes but no model change). Among 129 calls, 6 died without any error message (just went away). This never happened before (the jobs sometimes died on an out of memory error, but at least there was an error message), so it appears to confirm Eric's assessment that the fix does not work. It may be interesting to note that none of them died on the first fit. Because the log is buffered I do not know exactly what they were doing when they died, but it is very possible that they were refitting after freezing one parameter. I had a look at memory use (via top while a few jobs were running). At first it looked like they were using less memory than before (< 1 GB), but at the end they had risen up to > 4 GB, more or less as before. I am wondering whether the individual Npred calls after initialization could be the reason. Apart for this, there is no call to individual fixed sources in our script. Npred is a single constant scalar and could be easily stored so that it is still available after initialization.

jballet commented 3 years ago

Any news about this? I have started the DR3 run, and I am limited by memory again (some jobs go up to 10 GB!). This will get only worse as we accumulate more sources. It would be so much easier if binned Likelihood did not pile all that useless memory... I also note that the memory failure often happens at the very end, when the fit is finished. I see that we are extracting flux values for all sources at this point (not just for the free sources). Could it be the reason? Of course the flux does not depend on the source map, but who knows. We don't really need the flux of fixed sources, we could just skip that. On the other hand, we also call Npred in order to write it in the output file. This is very useful. Npred is just one number per component, so I thought it was stored even for the fixed sources. It would be good if it was, but if it is not, we could probably store it ourselves. This is of course not a general solution (other users will not naturally store it).

eacharles commented 3 years ago

I’m still trying to catch up on a lot of other things. I’ll try and look at this in the next few days.

-e

On Nov 19, 2020, at 12:58 AM, jballet notifications@github.com wrote:

Any news about this? I have started the DR3 run, and I am limited by memory again (some jobs go up to 10 GB!). This will get only worse as we accumulate more sources. It would be so much easier if binned Likelihood did not pile all that useless memory...

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/fermi-lat/Likelihood/issues/90#issuecomment-730227777, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIVS6H4XGKOC7EF3RSLSQTM2XANCNFSM4OWSFUSQ.

dagorym commented 3 years ago

I've been looking in to this off and on over the last few weeks as I have dug into the code and thought I should write up what I've found.

First, for testing, I implemented a variation on my original fix that, instead of just keying off the m_save_models flag, added a new variable that was only set to true if the code had previously called the clear_model() method to throw away the unneeded model. The hope was that if the code had explicitly called clear_model(), it knew it didn't need to reload it and would work properly but if we were somehow getting to the crash point before clear_model() had ever been called, it would still load up the model properly. Unfortunately this did not solve the problem.

So let's look at a few cases. First, the original sample script Jean gave me to test this. If I run it with the original version of the code, it works fine but uses ~3GB of memory. If I run it with the fix in place, it still works just fine, but only uses 930 MB. This script loads up the same model 20 times to push up the memory usage. But if you just have a single instance of everything the memory usage is 245 MB and 134 MB respectively. So the fix is doing its job in this case.

In the test script Eric supplied, (linked in an earlier comment), it works fine with the original version of the code but crashes with the fix. In both cases using about 116 MB of memory.

The next thing I did was just run the basic gtlike command on both data sets, again with and without the fix. With Jean's data set, gtlike crashes (in both cases) with an error that says: "Caught N10optimizers11OutOfBoundsE at the top level: Attempt to set the value outside of existing bounds." although again I see the memory savings with the original version using 234 MB and the version with the fix using only 124 MB.

If I run gtlike on Eric's data set, it works both with and without the fix each using about 1 GB of memory.

So the crash that Eric is reporting only seems to happen with the particular call chain in his script. So I've been digging through the code specifically looking at that. The code has seven steps: 1) create a keyword dictionary and then create a BinnedObs object with that dictionary as the parameters 2) create a BinnedConfig object 3) create a BinnedLikelihood object (with objects from above as input) 4) initialize the output streams 5) read the source model XML file 6) call the BinnedLikelihood countsSpectrum() method 7) call the BinnedLikelihood buildFixedModelWts() method

The failure is happening in step 7 but step 5 is really where it begins. When the XML file is read, for the fixed sources only, it loads up the model maps, copies them over to the fixed source model internally and then since the defaults are to not save the individual model maps, calls clear_model() and deletes the data for those sources, freeing up the extra memory. With the fix in place, this sets the flag that the model has been cleared and shouldn't be needed any more.

The problem comes in the call to buildFixedModelWts(). When this is called, it goes through and gets the maps for all the sources, both fixed and free. With the fix in place the maps for the fixed sources are empty as instructed to save memory and the maps for the free sources are loaded the first time the map is accessed (the getSourceMap() is called many times during a fit). The BuildFixedModelWts() method, as part of its processing, calls the addFixedSource() method again for all the fixed sources (which was what was done while reading the XML in step 5. But this time, because the source maps are empty vectors, the program crashes as it tries to read beyond the end of the (empty) vector.

If I simply comment out the line in buildFixedModelWts() that calls the addFixedSource() method, Eric's script no longer crashes. Of course I don't know if it did the right thing as there is no output in that script. But that may be a solution. In the end though it seems that the problem is that buildFixedModelWts() is trying to reload the data instead of using the already computed fixed model map.

That said, Jean, you mentioned in a previous comment that some of his jobs were still crashing even with the fix. Could you send me a test set that is still crashing so I can see how those crashes are happening and how they relate?

eacharles commented 3 years ago

Hi Jean, Tom.

I just wanted to check where we were with this. Apologies that it has taken me so long to get to it, but I can chip in now if that is useful.

-e

On Dec 7, 2020, at 2:21 PM, Tom Stephens notifications@github.com wrote:

I've been looking in to this off and on over the last few weeks as I have dug into the code and thought I should write up what I've found.

First, for testing, I implemented a variation on my original fix that, instead of just keying off the m_save_models flag, added a new variable that was only set to true if the code had previously called the clear_model() method to throw away the unneeded model. The hope was that if the code had explicitly called clear_model(), it knew it didn't need to reload it and would work properly but if we were somehow getting to the crash point before clear_model() had ever been called, it would still load up the model properly. Unfortunately this did not solve the problem.

So let's look at a few cases. First, the original sample script Jean gave me to test this. If I run it with the original version of the code, it works fine but uses ~3GB of memory. If I run it with the fix in place, it still works just fine, but only uses 930 MB. This script loads up the same model 20 times to push up the memory usage. But if you just have a single instance of everything the memory usage is 245 MB and 134 MB respectively. So the fix is doing its job in this case.

In the test script Eric supplied, (linked in an earlier comment), it works fine with the original version of the code but crashes with the fix. In both cases using about 116 MB of memory.

The next thing I did was just run the basic gtlike command on both data sets, again with and without the fix. With Jean's data set, gtlike crashes (in both cases) with an error that says: "Caught N10optimizers11OutOfBoundsE at the top level: Attempt to set the value outside of existing bounds." although again I see the memory savings with the original version using 234 MB and the version with the fix using only 124 MB.

If I run gtlike on Eric's data set, it works both with and without the fix each using about 1 GB of memory.

So the crash that Eric is reporting only seems to happen with the particular call chain in his script. So I've been digging through the code specifically looking at that. The code has seven steps:

create a keyword dictionary and then create a BinnedObs object with that dictionary as the parameters create a BinnedConfig object create a BinnedLikelihood object (with objects from above as input) initialize the output streams read the source model XML file call the BinnedLikelihood countsSpectrum() method call the BinnedLikelihood buildFixedModelWts() method The failure is happening in step 7 but step 5 is really where it begins. When the XML file is read, for the fixed sources only, it loads up the model maps, copies them over to the fixed source model internally and then since the defaults are to not save the individual model maps, calls clear_model() and deletes the data for those sources, freeing up the extra memory. With the fix in place, this sets the flag that the model has been cleared and shouldn't be needed any more.

The problem comes in the call to buildFixedModelWts(). When this is called, it goes through and gets the maps for all the sources, both fixed and free. With the fix in place the maps for the fixed sources are empty as instructed to save memory and the maps for the free sources are loaded the first time the map is accessed (the getSourceMap() is called many times during a fit). The BuildFixedModelWts() method, as part of its processing, calls the addFixedSource() method again for all the fixed sources (which was what was done while reading the XML in step 5. But this time, because the source maps are empty vectors, the program crashes as it tries to read beyond the end of the (empty) vector.

If I simply comment out the line in buildFixedModelWts() that calls the addFixedSource() method, Eric's script no longer crashes. Of course I don't know if it did the right thing as there is no output in that script. But that may be a solution. In the end though it seems that the problem is that buildFixedModelWts() is trying to reload the data instead of using the already computed fixed model map.

That said, Jean, you mentioned in a previous comment that some of his jobs were still crashing even with the fix. Could you send me a test set that is still crashing so I can see how those crashes are happening and how they relate?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/fermi-lat/Likelihood/issues/90#issuecomment-740215715, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIRKK2PN262NFGI37F3STVINXANCNFSM4OWSFUSQ.

jballet commented 3 years ago

On my side, it may take me a while to find a suitable test case that crashes with FT 1.4.2. As I noted above this happens only in a small fraction of my jobs, and each job is complex so simplifying it into something usable by you while preserving the core dump is not easy. For now my priority is the production of 4FGL-DR3.

dagorym commented 3 years ago

If you have time to look into it Eric, that would be great. I've posted pretty much everything I've found looking at it so far. And since I'm only part time on Fermi now (I started a full-time, tenure track position in the CS department at BYU this week), I may not have much time to work on it going forward. Which means that, to Jean comment, there is no real rush on the the sample test case. For my part, get to it when you can.

jballet commented 3 years ago

I have now moved to FermiTools 2.0.18, which uses just as much memory. I could understand that you did not want to invest too much time into fixing the old version, but I think someone should look into that for the new one. This is a nuisance to everyone using a relatively complex source model.

jballet commented 2 years ago

Any news on this memory issue that is the limiting factor on my PC farm? At some point Tom requested an example that failed with the fix he introduced. If this is still deemed useful, could you introduce the fit in a beta version based on python 3 (Fermitools 2.x)?

dagorym commented 2 years ago

Jean, an example case would be helpful. I'm in the process of compiling a set of tests that work currently (albeit using a lot of memory) that exercise likelihood in a number of different ways which I can use to verify that all the paths through the code are tested when we make changes. If you have a scenario that you know works with the high memory use version but fails with the fix, that would be helpful as well.

The only existing version with the fix in place for python 3 is fermitools-2.0.2 in the dev channel (add -c fermi/label/dev to the conda command to download and install). If you need one past 2.0.18, that will take some time as I'm currently swamped getting the five courses I'm teaching this semester going but should have some time in the coming weeks as things are starting to settle down.

jballet commented 2 years ago

I have launched a set of jobs using fermitools-2.0.2, but I don't see any reduction in memory (nor any failure). It looks like that version does not contain your fix, unless you updated it under the same name after we installed it (19 March 2021).

dagorym commented 2 years ago

Okay. It's entirely possible I did revert and rebuild over the fix. I'll build a new version.

jballet commented 2 years ago

Hi Tom, Is a new beta version with your fix available for testing?

dagorym commented 2 years ago

Not yet. We're having issues with the build system. As soon as that is resolved, I'll get a version created.

On Fri, Feb 25, 2022 at 10:37 AM jballet @.***> wrote:

Hi Tom, Is a new beta version with your fix available for testing?

— Reply to this email directly, view it on GitHub https://github.com/fermi-lat/Likelihood/issues/90#issuecomment-1051056032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3LPSKHCS2PIAZDGRSFT4DU4644HANCNFSM4OWSFUSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

dagorym commented 2 years ago

Now that the build pipeline issues are resolved (and I've learned the new build system that Alex implemented), and I'm not swamped with teaching five classes, I've built a version of the Fermitools that has the memory fix. It is version 2.1.29 on the dev channel.

This is simply reinstating what I did nearly two years ago so I know there is at least one use case where it fails (although I might have a fix for that as well). Jean, test it out with your use cases and if you get failures, let me know and send me a failing test case (or two) so I can try to see what is happening. It doesn't matter if they take a while to run or use a lot of memory. I'll typically be running them overnight and have a machine with 32 GB of RAM that I'm testing on.

jballet commented 2 years ago

Thanks, Tom. On my side I am quite busy those days. I fear it will have to wait until the 2nd half of June. But I do have ideas for test cases. Cheers, Jean

De : Tom Stephens @.> Envoyé : mardi 17 mai 2022 22:28 À : fermi-lat/Likelihood @.> Cc : Ballet Jean @.>; Author @.> Objet : Re: [fermi-lat/Likelihood] Memory use and delete_local_fixed (#90)

Now that the build pipeline issues are resolved (and I've learned the new build system that Alex implemented), and I'm not swamped with teaching five classes, I've built a version of the Fermitools that has the memory fix. It is version 2.1.29 on the dev channel.

This is simply reinstating what I did nearly two years ago so I know there is at least one use case where it fails (although I might have a fix for that as well). Jean, test it out with your use cases and if you get failures, let me know and send me a failing test case (or two) so I can try to see what is happening. It doesn't matter if they take a while to run or use a lot of memory. I'll typically be running them overnight and have a machine with 32 GB of RAM that I'm testing on.

— Reply to this email directly, view it on GitHubhttps://github.com/fermi-lat/Likelihood/issues/90#issuecomment-1129287463, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI34LBNSEEBAMXENR56ZV63VKP6OTANCNFSM4OWSFUSQ. You are receiving this because you authored the thread.Message ID: @.**@.>>

dagorym commented 2 years ago

Jean, use version 2.1.39. I just checked in a fix that solves the issue with the test script that Eric provided and also passes several other tests I have. There might still be some edge cases but it should be more robust. Let me know what you find.

jballet commented 2 years ago

Thanks, Tom.

I had no time to work on the previous version but David Landriu had run the basic test harness that we use to check that developments of the Catalog pipeline do not break other things, and it broke.

We will see whether 2.1.39 gets further.

If it does, I will try it on full-scale Catalog data.

Cheers,

Jean

De : Tom Stephens @.***> Envoyé : vendredi 10 juin 2022 00:09:17 À : fermi-lat/Likelihood Cc : Ballet Jean; Author Objet : Re: [fermi-lat/Likelihood] Memory use and delete_local_fixed (#90)

Jean, use version 2.1.39. I just checked in a fix that solves the issue with the test script that Eric provided and also passes several other tests I have. There might still be some edge cases but it should be more robust. Let me know what you find.

— Reply to this email directly, view it on GitHubhttps://github.com/fermi-lat/Likelihood/issues/90#issuecomment-1151669811, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI34LBIXJSKOGVQYZC7XOXDVOJTQ3ANCNFSM4OWSFUSQ. You are receiving this because you authored the thread.Message ID: @.***>

jballet commented 2 years ago

Hi Tom, David Landriu managed to break version 2.1.39. It core dumps when we simply freeze one parameter. I put a test harness on ftp://ftp.cea.fr/incoming/y2k01/jbbglast/Memory_tHb1.tgz The script is pyLhAnalysis_4.py. It core dumps after writing "Freeze isotropic and refit" and the initializing lines of Minuit. If the delete_local_fixed option is unset, then it passes.

dagorym commented 2 years ago

I found some other things that crash it as well. I'll download this and start looking at it. Thanks.

Tom

On Mon, Jun 20, 2022 at 11:00 AM jballet @.***> wrote:

Hi Tom, David Landriu managed to break version 2.1.39. It core dumps when we simply freeze one parameter. I put a test harness on ftp://ftp.cea.fr/incoming/y2k01/jbbglast/Memory_tHb1.tgz The script is pyLhAnalysis_4.py. It core dumps after writing "Freeze isotropic and refit" and the initializing lines of Minuit. If the delete_local_fixed option is unset, then it passes.

— Reply to this email directly, view it on GitHub https://github.com/fermi-lat/Likelihood/issues/90#issuecomment-1160669569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3LPSINHM3Y7MO4AIMNY5LVQCPSRANCNFSM4OWSFUSQ . You are receiving this because you were assigned.Message ID: @.***>

dagorym commented 2 years ago

Jean, Is that file still there? I tried to download it but it keeps failing. I also discovered that neither Firefox, Chrome, or Edge will open a ftp link so I had to use a command line client. :-(

jballet commented 2 years ago

Yes, it is still there (for one week at least) but it is true that it is becoming more and more difficult to retrieve a file from an ordinary FTP site. I put it at another, supposedly more modern, place https://filesender.renater.fr/?s=download&token=521ace6c-4f02-481d-bace-ba1b582f436f

dagorym commented 2 years ago

Okay. I have a version of the tools that works almost perfectly. It passes every test Jean has given me (including the last one), passes Eric's test, and passes the basic likelihood tests. It also only uses 739 MB instead of 2.8GB of memory in the big memory test that kicked this all off. The only thing it doesn't pass is the fermipy test test_gtanalysis_lightcurve which I'm tracking down now. If you want to play with it and try to find other cases that fail for me to fix, it is currently fermitools version 2.2.2 in the dev channel. As a bonus, it also seems to run about 20% faster on the big memory test.

jballet commented 2 years ago

Hi Tom, Are you sure that this new version went into 2.2.2? David saw no difference and I confirmed that the version he installed still core dumps on the test harness I sent you. He installed it with mamba create --name fermi-2.2.2 -c conda-forge -c fermi -c fermi/label/dev python=3.9 fermitools=2.2.2 --override-channels mamba install -c conda-forge -c fermi healpy gammapy sympy pip install fermipy

dagorym commented 2 years ago

I wasn't testing the tools with fermipy, I was just running the script you sent me directly and it works just fine.

Trying the build in fermipy (using pytest) I see that it does segfault. Looking at the stack trace in the debugger, for some reason it seems to be skipping over some of the function calls, including the ones that have the code that fixes the problem. i.e. it should call BinnedLikelihood::addFixedSource() -> BinnedLikelihood::addSourceCounts() -> FitUtils::addsourceCounts() and the stack trace is only showing BinnedLikelihood::addFixedSource() ->FitUtils::addsourceCounts(). And the BinnedLikelihood::addSourceCounts() method has some of the corrective code in it.

The crash I'm seeing is definitely indicative of the models not be loaded, which is what the code I added is supposed to be doing. I'm not sure if this is a compiler optimization issue in the way the tools are built or something inherent to fermipy (which I can't do anything about) and the way it is invoking the libraries. Or something completely different. I believe it is a combination of things but has to be somehow related to the way fermipy is using the libraries since it works just find calling the python scripts directly. I can run my locally built version of the tools in fermipy without the segfault but for one of the 81 fermipy tests I get the wrong answer. The conda built version segfaults as you have seen.

I'm trying to track this down but it may be a while as I'm in the process of temporarily relocating to Tucson, AZ until the end of the year with my wife and our youngest two kids. She is going on sabbatical and we head out on the 22nd of July.

dagorym commented 2 years ago

Turns out it was missing from 2.2.2. I was missing a step in our new build system and so wasn't updating the top level package to pick up the changes I had made. That has been fixed. Version 2.2.7 should work and have the fix incorporated.

This version still fails the single fermipy test and I need to track that down to figure out what is happening but should have the memory usages solved for most cases.

dagorym commented 2 years ago

I think I've finally solved this. The failure in the fermipy tests had two parts. One was I was not properly initializing some of the maps when they were reloaded. Upon fixing that, I started getting segmentation faults in a different test.

It turn out that was due to a memory corruption issue. The code was designed to store references to a single BinnedLikeConfig object, probably to save a small bit of memory and copying overhead. However, the python interface through fermipy was deleting the objects that the C++ code had references instead of keeping them around and then the C++ code was trying to access memory that was no longer valid. It often worked but sometimes would get corrupted. The code has been changed to make a copy of the config object instead of relying on the calling code to keep it valid.

Everything is working and the code is properly using less memory (~950MB vs 2.8 GB in the big memory test). This fixed version of the code is in version 2.2.12 on the dev channel if anyone wants to test it out. If I don't hear any issues, I'll submit a pull request to merge it into the main branch in a week or two.

fermi-lat / Likelihood

Memory use and delete_local_fixed #90