load_zcat directory problems

sbailey commented 6 years ago

load_zcat at the end of desispec.test.old_integration_test fails because it is trying to load zbest files from an incorrect path:

INFO:datachallenge.py:409:load_zcat: Using zbest file search path: /global/cscratch1/sd/sjbailey/desi/test/redux/dailytest/spectro/redux/dc17a2/spectra-64/*/*/zbest-64-*.fits.
ERROR:datachallenge.py:412:load_zcat: No zbest files found!

The above old_integration_test was run with

export DESI_SPECTRO_REDUX=/global/cscratch1/sd/sjbailey/desi/test/redux
export DESI_SPECTRO_SIM=/global/cscratch1/sd/sjbailey/desi/test/sim
export SPECPROD=dailytest
export PIXPROD=dailytest
python -m desispec.test.old_integration_test

Looking at the load_zcat code, the docstring says that datapath is "Full path to the directory containing zbest files" but then it actually appends quite a bit to that:

    zbestpath = join(datapath, 'spectro', 'redux', run1d, 'spectra-64',
                     '*', '*', 'zbest-64-*.fits')

Note that we do not require $DESI_SPECTRO_REDUX to end in spectro/redux, but this code assumes that. We also don't have any canonical environment variable or definition for what the directory two levels up from $DESI_SPECTRO_REDUX is supposed to be (but that is what datapath is supposed to be as currently written).

I suggest dropping the run1d input and that datapath should default to desispec.io.specprod_root(), i.e. $DESI_SPECTRO_REDUX/$SPECPROD, and zbestpath should be constructed as

    zbestpath = join(datapath, 'spectra-64', '*', '*', 'zbest-64-*.fits')

bonus points for encoding 64 as an optional nside input with default 64.

This change will require an update to the minitest notebook as well.

Related: standard pipeline processing (but not yet the integration test) will use desi_zcatalog to pre-concatenate the zbest files into a single $DESI_SPECTRO_REDUX/$SPECPROD/zcatalog-${SPECPROD}.fits file which could be used by load_zcat instead.

weaverba137 commented 6 years ago

I'm going to have to ask a couple of questions here:

Is it even worth doing this? Why not just do the work and have the integration test load from the zcatalog file? That seems like a much better use of time.
What is the equivalent of desispec.io.specprod_root() for things like the target, exposures, and fiberassign files? Because if I have to change the root directory for loading the redshifts, then I have to assume that I'll have to change the root for all types of files. Can we even assume that one directory tree corresponds to one database (so that I can pass the root path of the directory tree on the command-line), or do I have to specify several possible paths to load everything?

weaverba137 commented 6 years ago

And a third question: exactly how urgent is this? Considering that we are now postponing the proposed 17.12 install set to January, does anyone care if the old integration test doesn't load the redshift database (but otherwise works just fine, I assume) over the holidays? In fact, why not just comment out the part that loads the redshifts, and wait until we can load from the zcatalog file?

sbailey commented 6 years ago

... exactly how urgent is this? ...

It's not urgent. I was just filing a ticket for a bug I noticed while testing the 17.12 release, rather than delaying the release to fix it or otherwise working around it. It's harmless for now, but we should update the integration test so that this works as well, since this is what runs in the nightly cronjob tests.

Is it even worth doing this? Why not just do the work and have the integration test load from the zcatalog file? That seems like a much better use of time.

Indeed, it would be fine to just load from the zcatalog file instead, but the change of what datapath means still applies. Which brings us to...

What is the equivalent of desispec.io.specprod_root() for things like the target, exposures, and fiberassign files? Because if I have to change the root directory for loading the redshifts, then I have to assume that I'll have to change the root for all types of files. Can we even assume that one directory tree corresponds to one database (so that I can pass the root path of the directory tree on the command-line), or do I have to specify several possible paths to load everything?

Good point. There currently isn't an equivalent of desispec.io.specprod_root() for those, and data challenges group those together under a common basedir whereas real data could have them more spread out. I think this means that load_zcat needs several input directories/files instead of a single datapath:

reduxdir: default desispec.io.specprod_root() = $DESI_SPECTRO_REDUX/$SPECPROD
fiberassigndir: Our datamodel claims that those will go in $DESI_TARGET/fiberassign/, but that isn't in active use yet and that choice hasn't been reviewed in years, while our fiber assignment strategy has evolved.
targetfile: runs on real data are in $DESI_TARGET/catalogs/targets-{source}-{version}.fits where e.g. source=dr5 and version=0.16.2.
truthfile: no equivalent for real data
expfile: there should be a real data equivalent of this; location not yet defined.
Are there other files/directories needed for load_zcat?

All of the above is assuming that desispec.database.datachallenge will evolve away from being specific to only data challenges and is the prototype code for loading the real spectro database on real DESI data.

weaverba137 commented 6 years ago

Revisiting this post-holiday:

I'm going to make the command decision that it is much more important and a better use of time to load directly from the zcatalog file. Are there existing Issues that describe the remaining problems, if any, in getting the needed data into the zcatalog file?

Related: standard pipeline processing (but not yet the integration test) will use desi_zcatalog to pre-concatenate the zbest files into a single $DESI_SPECTRO_REDUX/$SPECPROD/zcatalog-${SPECPROD}.fits file which could be used by load_zcat instead.

Why can't the integration test do this? Why not make the integration test do what the pipeline does?

sbailey commented 6 years ago

The integration test could do this, please go ahead and update it to do so. That was a statement of the current state of affairs, not a statement about how it should be or must be. Loading from zcatalog is indeed better than rescanning all the zbest files. I'm not aware of any remaining issues that would prevent that.

weaverba137 commented 6 years ago

OK. And just to be sure, by "integration test" we still mean old_integration_test?

sbailey commented 6 years ago

Yes, this is about the old_integration_test. We haven't yet integrated the DB loading into the full pipeline scripts which are what is tested by (non-old) "integration_test". That can be future work.

weaverba137 commented 6 years ago

One more question: when you were testing this loading, which branch was this on?

sbailey commented 6 years ago

I believe it was desispec master just prior to the 0.17.1 tag. It still has the problem in current master, e.g. see the end of the log from the dailytest cronjob last night:

/global/cscratch1/sd/desi/dailytest/spectro/redux/dailytest/dailytest.log

...
INFO:datachallenge.py:617:setup_db: Begin creating tables.
INFO:datachallenge.py:621:setup_db: Finished creating tables.
INFO:datachallenge.py:409:load_zcat: Using zbest file search path: /global/cscratch1/sd/desi/dailytest/spectro/redux/dailytest/spectro/redux/dc17a2/spectra-64/*/*/zbest-64-*.fits.
ERROR:datachallenge.py:412:load_zcat: No zbest files found!
...

weaverba137 commented 6 years ago

OK, thanx.

desihub / desispec

load_zcat directory problems #481