desihub / specter

A toolkit for simulating multi-object spectrographs
Other
8 stars 7 forks source link

legval_numba cache collision? #65

Closed sbailey closed 6 years ago

sbailey commented 6 years ago

When running pixsim_nights_mpi from desisim, I'm getting a traceback that ends with

FileNotFoundError: [Errno 2] No such file or directory: '/global/homes/s/sjbailey/.cache/numba/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/util.legval_numba-231.py36m.1.nbc.tmp.43958' -> '/global/homes/s/sjbailey/.cache/numba/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/util.legval_numba-231.py36m.1.nbc'

I'm running

srun -N 30 -n 960 -c 2 -C haswell -t 3:00:00 --qos interactive \
    pixsim_nights_mpi --nights 20200515 --cosmics --nodes_per_exp 10 --nexp 39

The MPI communicator gets split into 3 communicators of 10 nodes each, and each of those communicators processes 1 exposure at a time. Those exposure communicators are further split into 10 frame communicators (1 per node) to process one frame at a time.

When I look in that /global/homes/s/sjbailey/.cache/numba/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/ directory, I see three legval_numba*.nbc files written within a minute of each other, perhaps one per exposure communicator. It appears that there may be some race condition with creating the .nbc files.

@lastephey or @rcthomas have you seen an MPI+numba caching problem like this before? I see that we use

@numba.jit(nopython=True,cache=True)
def legval_numba(x, c):
    ...

I'm wondering if cache=True is problematic with MPI.

Full traceback:

Traceback (most recent call last):
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/desisim/0.28.0/bin/pixsim_nights_mpi", line 4, in <module>
    __import__('pkg_resources').run_script('desisim==0.28.0', 'pixsim_nights_mpi')
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1434, in run_script
    exec(code, namespace, namespace)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/desisim/0.28.0/lib/python3.6/site-packages/desisim-0.28.0-py3.6.egg/EGG-INFO/scripts/pixsim_nights_mpi", line 20, in <module>
    sys.exit(pixsim_nights.main(args, comm=comm))
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/desisim/0.28.0/lib/python3.6/site-packages/desisim-0.28.0-py3.6.egg/desisim/scripts/pixsim_nights.py", line 206, in main
    ccdshape=None, simpixfile=None, addcosmics=addcosmics, comm=comm_exp)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/desisim/0.28.0/lib/python3.6/site-packages/desisim-0.28.0-py3.6.egg/desisim/pixsim.py", line 122, in simulate_exposure
    psfs[channel] = desimodel.io.load_psf(channel)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/desimodel/0.9.6/lib/python3.6/site-packages/desimodel-0.9.6-py3.6.egg/desimodel/io.py", line 51, in load_psf
    _psf[channel] = specter.psf.load_psf(psffile)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/psf/__init__.py", line 29, in load_psf
    return SpotGridPSF(filename)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/psf/spotgrid.py", line 31, in __init__
    PSF.__init__(self, filename)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/psf/psf.py", line 74, in __init__
    self._w = self._y.invert()
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/traceset.py", line 69, in invert
    ytmp = self.eval(None, (self._xmin, self._xmax))
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/traceset.py", line 58, in eval
    y.append(legval_numba(xx, cc_i))
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/dispatcher.py", line 360, in _compile_for_args
    raise e
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/dispatcher.py", line 311, in _compile_for_args
    return self.compile(tuple(argtypes))
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/dispatcher.py", line 620, in compile
    self._cache.save_overload(sig, cres)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/caching.py", line 665, in save_overload
    self._save_overload(sig, data)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/caching.py", line 675, in _save_overload
    self._cache_file.save(key, data)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/caching.py", line 492, in save
    self._save_data(data_name, data)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/caching.py", line 555, in _save_data
    f.write(data)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/conda/lib/python3.6/site-packages/numba/caching.py", line 577, in _open_for_write
    utils.file_replace(tmpname, filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/global/homes/s/sjbailey/.cache/numba/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/util.legval_numba-231.py36m.1.nbc.tmp.43958' -> '/global/homes/s/sjbailey/.cache/numba/global/common/software/desi/cori/desiconda/20180709-1.2.6-spec/code/specter/0.8.6/lib/python3.6/site-packages/specter-0.8.6-py3.6.egg/specter/util/util.legval_numba-231.py36m.1.nbc'
rcthomas commented 6 years ago

Perhaps look here

https://github.com/numba/numba/blob/master/numba/caching.py#L114

There are a variety of schemes for caching, I don't see one that fits, but perhaps one could be contributed. A cache locator that lets you set the directory name based on MPI rank, and you'd just set a prefix like /tmp/numba- Otherwise you have to manage the race condition, and the best answer for that is probably you should AOT anyway.

lastephey commented 6 years ago

Yes AOT is probably the way to go. I looked into it at some point and found it was about the same speed as the JIT version. I will work on this.

On Thu, Jul 19, 2018 at 9:32 PM R. C. Thomas notifications@github.com wrote:

Perhaps look here

https://github.com/numba/numba/blob/master/numba/caching.py#L114

There are a variety of schemes for caching, I don't see one that fits, but perhaps one could be contributed. A cache locator that lets you set the directory name based on MPI rank, and you'd just set a prefix like /tmp/numba- Otherwise you have to manage the race condition, and the best answer for that is probably you should AOT anyway.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desihub/specter/issues/65#issuecomment-406486250, or mute the thread https://github.com/notifications/unsubscribe-auth/AF8Y9GtiiB9zkis1JYgIGL3_DNUoc6Jaks5uIV1JgaJpZM4VXbQ8 .