fermi-lat / Fermitools-conda

Conda recipe files for the Fermi Sciencetools software analysis package: Fermitools
https://fermi.gsfc.nasa.gov/ssc/data/analysis/
BSD 3-Clause "New" or "Revised" License
34 stars 17 forks source link

gtsrcmap fails with floating point exception #59

Closed me-manu closed 3 years ago

me-manu commented 4 years ago

I'm trying to generate a weekly light curve for the Crab Nebula, so I'm running a standard analysis for each week separately. For 4 time bins I get a strange error when the srcmap is computed. Within fermipy I get (I already included a handling of the exception when the setup fails and I check for the number of events in the ft1 file. It's small but should still be ok):

2020-02-14 00:49:12 INFOGTBinnedAnalysis.setup(): Finished setup for component 00
2020-02-14 00:49:12 INFOGTBinnedAnalysis._create_binned_analysis(): Creating BinnedAnalysis for component 00.
lc_binned.py: 247 --- ERROR: Caught Runtime/Index Error while initializing analysis object
lc_binned.py: 248 --- ERROR: Printing error:
lc_binned.py: 249 --- ERROR: Cannot read keyword "NDSKEYS" in file "/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/srcmap_00.fits" (CFITSIO ERROR 202: keyword not found in header)
/u/gl/mmeyer/projects/python/fermiAnalysis/fermiAnalysis/scripts/lc_binned.py:250: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  if e.message.find("File not found") >= 0 and e.message.find('srcmap') >= 0:
lc_binned.py: 253 --- INFO: Checking if there are events in ft1 file
lc_binned.py: 263 --- INFO: The ft1 file contains 61 event(s)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/u/gl/mmeyer/projects/python/fermiAnalysis/fermiAnalysis/scripts/lc_binned.py in <module>()
    426 
    427 if __name__ == '__main__':
--> 428     gta = main()

/u/gl/mmeyer/projects/python/fermiAnalysis/fermiAnalysis/scripts/lc_binned.py in main()
    243     logging.info('Running fermipy setup')
    244     try:
--> 245         gta.setup()
    246     except (RuntimeError,IndexError) as e:
    247         logging.error('Caught Runtime/Index Error while initializing analysis object')

/nfs/farm/g/glast/u/mmeyer/fermipy/fermipy/gtanalysis.pyc in setup(self, init_sources, overwrite, **kwargs)
   1061 
   1062         # Create likelihood
-> 1063         self._create_likelihood()
   1064 
   1065         # Determine tmin, tmax

/nfs/farm/g/glast/u/mmeyer/fermipy/fermipy/gtanalysis.pyc in _create_likelihood(self, srcmdl)
   1086         self._like = SummedLikelihood()
   1087         for c in self.components:
-> 1088             c._create_binned_analysis(srcmdl)
   1089             self._like.addComponent(c.like)
   1090 

/nfs/farm/g/glast/u/mmeyer/fermipy/fermipy/gtanalysis.pyc in _create_binned_analysis(self, xmlfile, **kwargs)
   5327         self.logger.debug(kw)
   5328 
-> 5329         self._obs = ba.BinnedObs(**utils.unicode_to_str(kw))
   5330 
   5331 

/nfs/farm/g/glast/u/mmeyer/miniconda-4.7/miniconda2/envs/fermi-dev-122/lib/python2.7/site-packages/fermitools/BinnedAnalysis.pyc in __init__(self, srcMaps, expCube, binnedExpMap, irfs, phased_expmap)
     69         # EAC switch to using AppHelpers...
     70         self.countsMap = pyLike.AppHelpers_readCountsMap(srcMaps)
---> 71         self._createObservation(srcMaps, expCube, self.irfs)
     72     def _createObservation(self, srcMaps, expCube, irfs):
     73         self._respFuncs = pyLike.ResponseFunctions()

/nfs/farm/g/glast/u/mmeyer/miniconda-4.7/miniconda2/envs/fermi-dev-122/lib/python2.7/site-packages/fermitools/BinnedAnalysis.pyc in _createObservation(self, srcMaps, expCube, irfs)
     72     def _createObservation(self, srcMaps, expCube, irfs):
     73         self._respFuncs = pyLike.ResponseFunctions()
---> 74         self._respFuncs.load_with_event_types(irfs, "", self.srcMaps, "")
     75         self._expMap = pyLike.ExposureMap()
     76         self._scData = pyLike.ScData()

/nfs/farm/g/glast/u/mmeyer/miniconda-4.7/miniconda2/envs/fermi-dev-122/lib/python2.7/site-packages/fermitools/pyLikelihood.pyc in load_with_event_types(self, respFuncs, respBase, filename, extname)
   8369     def load_with_event_types(self, respFuncs, respBase, filename, extname):
   8370         """load_with_event_types(ResponseFunctions self, std::string const & respFuncs, std::string const & respBase, std::string const & filename, std::string const & extname)"""
-> 8371         return _pyLikelihood.ResponseFunctions_load_with_event_types(self, respFuncs, respBase, filename, extname)
   8372 
   8373 ResponseFunctions_swigregister = _pyLikelihood.ResponseFunctions_swigregister

RuntimeError: Cannot read keyword "NDSKEYS" in file "/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/srcmap_00.fits" (CFITSIO ERROR 202: keyword not found in header)

So then, I tried to run gtsrcmaps from the command line. I get the following:

(fermi-dev-122) mmeyer@ki-ls10:~/projects/VarFSRQ/tmp/mmeyer.hjw0oV$ gtsrcmaps  

WARNING: version mismatch between CFITSIO header (v3.43) and linked library (v3.41).

Exposure hypercube file[/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/ltcube_00.fits] 
Counts map file[/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/ccube_00.fits] 
Source model file[/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/srcmdl_00.xml] 
Binned exposure map[/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/bexpmap_00.fits] 
Source maps output file[/u/gl/mmeyer/projects/VarFSRQ/tmp/mmeyer.hjw0oV/srcmap_00.fits] 
Response functions[P8R3_SOURCE_V2] 
Generating SourceMap for 4FGL J0449.6+3027 9....................!
Floating point exception

I'm working on the SLAC batch farm with the developer's version of the fermitools 1.2.2.

I also checked the mission time line, https://fermi.gsfc.nasa.gov/ssc/observations/timeline/posting/, but did not find anything weird in mission week 520 which my time bin corresponds to.

I'm also attaching my par files for the analysis so that you can try to recreate the error. The necessary ft1 and xml files are all at SLAC, I can point you to the directories if you like. par_files.zip

donhorner commented 4 years ago

The first thing you can try is to update to the latest version of the tools we released this week, 1.2.23. Unfortunately, I don't think any of the changes in that release would have fixed this type of issue, but it's worth a shot.

We have another open issue with a core dump by gtrsrcmaps that we never resolved. I wonder if they have the same cause. Please point us to the location of the files, so we can test with builds that have extra debug information turned on.

me-manu commented 4 years ago

Thanks for the answer. I just tried with 1.2.23 without success. The input ft1 file is located here: /nfs/slac/g/ki/ki20/cta/mmeyer/projects/VarFSRQs/Output/Crab/Crab-Sync-All/100MeV-1GeV/weekly/t001/ft1_00.fits The spacecraft file is here: /nfs/slac/g/ki/ki20/cta/mmeyer/projects/VarFSRQs/Output/Crab/ft2/30s.fits And the xml file I'm using resides here: /nfs/slac/g/ki/ki20/cta/mmeyer/projects/VarFSRQs/Output/Crab/Crab-Sync-All/100MeV-1GeV/weekly/t001/avgspec_00.xml Everything else you need should be in the par files I posted earlier. If not please let me know!

donhorner commented 4 years ago

I grabbed your files and ran gtsrcmaps on the command line like you did above. It ran to completion on my mac and a local RHEL6 system I tried. Which OS are you using on the batch farm, CentOS7?

me-manu commented 4 years ago

Sorry for my slow response. I log into the ki-ls machines, which are RHEL6, I think. I submit my batch jobs to RHEL6 machines only. Did you use the same time window as in my par file? For most light curve bins the analysis works and only for some it fails.

sarabuson commented 3 years ago

Hi, I have run into this issue many times as well. When computing e.g. a few hundreds lightcurves, 1% of the time-bins crash and most problems are related to this srcmap issue. I made several tests. I run with ki-ls machines, both command line and submitting jobs to other hosts. I tried also e.g. rhel6-64f command line and still get "Floating point exception".

I checked carefully one such time bins. The ft1 file looks good, I computed an exposure map (gtexpmap) and it looks good. I tried to run the analysis with the unbinned method and it worked successfully, completing the likelihood fit. So there is valuable scientific information that is missed. This is really problematic.

dagorym commented 3 years ago

@sarabuson Do you have a set of test files I could use to recreated this issue?

This may be the same problem as this issue: https://github.com/fermi-lat/Likelihood/issues/83 but would like to verify. The files referenced by @me-manu are not longer present on the SLAC computers for me to grab.

But given the error me-manu posted from fermipy, it may be that the source map file is not being created properly as that fits error is very specific.

dagorym commented 3 years ago

I'm going to mark this one as resolved unless it comes up again with the new version of gtsrcmaps and someone can provide sample files that reproduce the issue.

sarabuson commented 3 years ago

Hi Tom, this is not solved. It comes up any time I run light curves.. I have been very busy with other work but will provide you with the files and more info shortly (I have many of them..).

dagorym commented 3 years ago

Okay, once you send me some files, I'll track this one down. Thanks.

sarabuson commented 3 years ago

I've provided Tom with a (not-)working example and post here few additional useful info just for future reference.

What I found after five full days of investigation:

  1. If I analyze this same dataset with unbinned likelihoood, it all is fine. So the issue is with binned likelihood, and with the gtsrcmap tool itself. (now on referring to binned likelihood:)
  2. the problem are a couple of sources included in the xml that have no exposure, when the gtsrcmap gets to one of them (e.g. 4FGL J0503.5-1116) it crashes with ” Floating point exception ”
  3. If I remove these sources from the srcmdl_00.xml , the srcmap computation is fine.
  4. Among many many identical tries, one computation of the srcmap went through without crashing. I put in the folder this output file for reference, it is “successfull_srcmap_00.fits”. The output for the problematic source 4FGL J0503.5-1116 is a set of zeros (but double check).
  5. I tried many ways, but I found none to overcome or bypass this issue with fermipy. My conclusion is that the gtsrcmap code has to be able to handle it. Indeed, that would also be the best desirable fix.
dagorym commented 3 years ago

Okay, the issue was a divide by zero (as expected) in a part of the code that was generating a scaling factor by dividing by the integral over the PSF. For sources with no exposure, this PSF is zero, hence the error. I added a guard that checked for the zero valued integral and just set the scaling factor to zero in that case since zero time zero will still be zero, this doesn't affect the analysis but prevents the crash.

Testing between the current version and the fixed version so that they produce the identical source maps on "good" data with no zero exposure sources but the fixed version handles the zero exposure objects gracefully simply providing zero valued source maps.

Built and tagged as version 2.0.5. You can install with the following command for testing: conda create -n fermitest -c conda-forge -c fermi -c fermi/label/dev fermitools=2.0.5

dagorym commented 3 years ago

These newest versions seem to have resolved the issue and Sara has reported that it is working now.