Organize the directory and data structure on shrek.caltech.edu

awhoward commented 3 years ago

We need to organize the directory/data structure on shrek.caltech.edu for KPF data. Below are some initial thoughts, all of which are up for discussion. Let's think about what structure will facilitate ease of use of the data, interaction with KOA, and use of the data by the DRP. I suggest that folks respond to this thread and we might also discuss over Zoom. After we have some consensus, let's migrate this onto a Confluence page.

As a point of reference, for HIRES data we have essentially three directories for Level 0, 1, and 2 data (although we don't use that nomenclature): /data/mir3/raw/ - 2D echelle spectra /data/mir3/iodspec/ - 1D reduced spectra (and separate directories for the blue and red chips, plus dirs for deblazed, etc. spectra) /data/mir3/vel/ - time series RVs

The mir1/mir3/mir4 structure for the Hamilton/HIRES/Levy spectrometers is historical and I don't think we need to replicate it for KPF. I propose using /data/kpf/ as the root of KPF data.

Our primary copy of the KPF data should be on shrek.caltech.edu and it could be cloned elsewhere using rsync, etc. or an updated version of cps-utils. I think we should keep instruments from current instruments (HIRES, APF) on cadence.caltech.edu'. A gray area is where data from instruments like NEID and MAROON-X should go. On the one hand,cadence.caltech.edu` should continue to be our primary computing facility for CPS. On the other hand, we may want to use the KPF DRP to reduce NEID/MAROON-X data. (We don't have to settle this issue now.)

For KPF, I think we should continue to segregate data by type, especially keeping the raw data in a separate part of the directory structure. These files will be huge and we may have a different backup strategy for them (e.g. KOA might be our backup). A problem with the current flat /data/mir3/raw/ is that with ~10^5 raw files from HIRES, the OS has trouble doing simple operations in that directory (like ls). So perhaps we went to use data codes? (Using UT dates keeps all nighttime spectra in a single day.) In this scheme, raw files would be in: /data/kpf/raw/210113/ /data/kpf/raw/210114/ etc.

Is there a downside to this? I suppose one is that you have to know the date to find a file.

What will the Level 0 KPF raw files names be? For all previous instruments we used names built off of run numbers (runs are a set of consecutive nights when CPS operates HIRES) and frame numbers, e.g. j1235678.fits for run = j123 and frame number = 5678. For KPF, the concept of a run is going away because of the queue and automated operations. We could use a file name format that is instrument/data-type + date + frame number, e.g. kpf0_210113_5678.fits. 'kpf0' identifies the data type (KPF Level 0), '210113' gives the UT date (which is redundant with the directory name, but this seems helpful for orphaned files), and '5678' is the frame number (note that we can take a maximum of 8640 exposures in a day with a 10 second read time so four digits is sufficient). In this naming convention, file names are relatively long but clear.

@arpita308 -- could you remind us if the other raw files (H&K, exposure meter, guiding(?)) are stored as separate files or as extensions of a master raw file? I can't remember.

For Level 1 data product, they could be stored in /data/kpf/lev1/210113/ /data/kpf/lev1/210114/ etc.

The file names should describe the data type. Are we planning to have green and red reduced spectra in the same Level 1 file? If so, a possible file name could be kpf1_210113_5678.fits or else kpf1g_210113_5678.fitsand kpf1r_210113_5678.fits.

For all of our previous instruments, the Level 2 data products (the RVs) are different because they are time series. I think we want to continue that with KPF, but let's discuss. KPF will be more complicated, because Level 2 will include tabular data (time, RV, sigma_RV, activity indicators) and more complicated information like the CCFs. KPF data will also have more than one RV for each spectrum (e.g., RVs for each order/slice combination). So maybe we want a single Level 2 data product per spectrum and also a time series for each star? I'm not sure.

The above notes also assume that there is a single reduction of the data. In practice we'll want an adopted reduction made with a well-tested version of the DRP. But we should allow for other reductions, e.g. line-by-line, updated versions of the DRP, etc. Maybe these would go in something like: /data/kpf/lev1/linebyline/210113/ /data/kpf/lev1/linebyline/210114/ and /data/kpf/lev1/DRP_v1.2.3/210113/ /data/kpf/lev1/DRP_v1.2.3/210114/ A downside is that /data/kpf/lev1/210113/ and /data/kpf/lev1/DRP_v1.2.3/ would be at the same level of the directory tree. This seems inelegant.

What about log sheets? I suspect that we're still going to want them, even with automated operations. (APF is automated and log sheets are helpful.) What are the helpful columns that we might include in a KPF log sheet? One that is missing from our current scheme would indicate if data is good or bad (this column might be called 'quality'). We sort of do this now by naming the object 'junk' in the log sheet for bad data. But sometimes bad quality calibration data is just not included in the log sheet (e.g. the focus sequence on HIRES); with KPF this will undoubtedly cause confusion for KOA users; I think it's better to identify all data written to disk as good or bad.

Are there other major data types that we need to think about?

I'm not sure where the wavelength solutions will be stored or what the right format is. @arpita308? Perhaps that's part of the standard Level 1 data?

Tagging @ashbake, @rrubenza, @petigura, @howardisaacson as they may be interested.

I suggested in the KPF DRP meeting that we might discuss this on the Jump Github Issues, but I think the issue is more general than Jump so let's discuss it here.

petigura commented 3 years ago

Hi, @awhoward.

I'm hoping that with our team's increased reliance on relational databases, our unix-level access of raw files will diminish. That said, we should think carefully about a sensible layout. I agree that we should move away from the /mir3/raw format as it strains the OS.

One thing I've seen on various astronomical archives where files are hierarchically grouped based on their name. For example, on the exoplanet archive, the data products associated with a particular star are stored in a depth-3 hierarchy.

https://exoplanetarchive.ipac.caltech.edu/data/KeplerData/010/010797/010797460/dv/kplr010797460-001-20130826182306_dvs.pdf

There is a folder of all KIC stars with 010 as the first three numbers 010797 as the first six, and 010797460 which uniquely identifies the star. In contrast, the MAST uses a depth-2 hierarchy:

https://archive.stsci.edu/missions/kepler/dv_files/0098/009850387/kplr009850387-01-20160209194854_dvs.pdf

UT date seems like a logical scheme for keeping the number of exposures per folder to O(10^3.)

As far as different reductions go. I'd recommend tying different reductions to unique git tags, so they are are traceable to a unique git SHA. Should we put the DRP identifier before the level directory? .e.g

/data/kpf/v1.2.3/lev0/210113/
/data/kpf/v1.2.3/lev0/210114/
/data/kpf/v1.2.3/lev1/210113/
/data/kpf/v1.2.3/lev1/210114/

I understand the desire to not have many different level0 products floating around because these will use a lot of disk space, but we could be smart about using symlinks.

awhoward commented 3 years ago

Thanks, @petigura. That's a good suggestion about the three- or two-level hierarchy.

Using symlinks seems like a good approach. We could define a "current" reduction with a symlink to a particular standard reduction: /data/kpf/current/ -> /data/kpf/v1.2.3/

bjfultn commented 3 years ago

Here is the file structure that I propose. It is similar to Eric's above with a few modifications.

/data/kpf/L0/20211014/
/data/kpf/L1/v1.2.3/20211014/
/data/kpf/L1/v1.2.3/20211014/
/data/kpf/L1/v1.2.3/20211015/
/data/kpf/L2/v1.2.3/20211014/
/data/kpf/L2/v1.2.3/20211015/

I don't think we need to keep track of versions for the L0 files. There is no way to re-create an L0 file so there is always only one version.

This means that if we want data level to be at the same directory level then we need to put the "L1", "L2", etc. in the same directory as L0 and we must store the version info after that. This also allows for re-processing runs where only the L2 files are updated and not the L1 files.

I don't think that we need to use the git SHA for anything in the directory structure since we are following a rigid software development process and every update to the pipeline that produces data that goes into these directories will have a tagged version number.

bjfultn commented 3 years ago

For reference here is what we decided on for the NEID archive disks.

Level 0:
/neid/archive/raw/yyyy/mm/dd

Levels 1 and 2:
/neid/archive/sci/yyyy/mm/dd/l<#>/pv8
/neid/archive/cal/yyyy/mm/dd/<caltype>/cv13
where # = 1 or 2, and caltype = 'bias', 'flat', etc. (unique across levels).  "pv" stands for "processing version".  "cv" stands for "calibration version".

arpita308 commented 3 years ago

Good proposal @bjfultn — I'm on board. Very tidy directory structure.

awhoward commented 3 years ago

Thanks, @bjfultn. I on board with this.

As a minor amendment, let me suggest that we use symbolic links to track the current adopted version, e.g. /data/kpf/L1/current/ -> /data/kpf/L1/v1.2.3/ which would be updated to /data/kpf/L1/current/ -> /data/kpf/L1/v1.2.4/ when v1.2.4 becomes current.

This means that various scripts that just want the latest version of the data don't need to know the precise version number.

bjfultn commented 3 years ago

Indeed, I think this is easiest to manage by hand

awhoward commented 3 years ago

One question: are we ever going to partially process an L0 file, which would result in another L0 file and necessitate directories like /data/kpf/L0/v1.2.3/20211014/? Maybe that would be a special case that wouldn't be part of the official processing.

bjfultn commented 3 years ago

I think that if it is partially processed then it is no longer a L0 file. It may be something like an L0.5 and I don't think we should store these undefined intermediate data products in this directory structure. In my opinion, only ICD-compliant L0, L1, or L2, products should go here.

awhoward commented 3 years ago

To keep things clean, perhaps only the pipeline (and possibly a small set of trusted users) should have write permissions for the above directly. By analogy, /mir3 is a bit of a mess because different people have different ideas about what can appropriately go in what directories.

bjfultn commented 3 years ago

I set up the /data/kpf/L? directories so I think this is done. I'll wait to set up the version and date directories until we have some data to put in there.

Keck-DataReductionPipelines / KPF-Pipeline

Organize the directory and data structure on shrek.caltech.edu #148