Improve and Clarify Organization of Raw Run2.1 imSim files

villarrealas commented 5 years ago

Instance catalogs are currently generated and sorted into quarter year sets (based on obsID). When generating outputs for imSim, we preserve this structure, leading to output directories such as: 00445379to00497969/00478822/

Currently this is wrapped up in two more layers of directory structure on NERSC: one referring to the fact that these are outputs and the other referring to the "batch" in some way. The relevant recent example is: Run2.1i-y2-wfd/outputs/...

This MIGHT not be the best final division for humans though, as evidenced by confusion about this batch, which consists of BOTH y1 and y2 data. My naive suggestion would be to either: 1) Join all final Run2.1i files as Run2.1i (with the exception of old PSF models). Keep the internal quarter year division by obsID. 2) Separate all Run2.1i files by year (again, with the exception of the old PSF models). Remove the internal quarter year division in favor of just bunching the whole year together.

The latter makes it slightly harder to backtrack to the instance catalogs in an automated fashion, but might be a little more human readable if one wants to focus on a specific year.

airnandez commented 5 years ago

I would like to comment on this, but before, I would state that personally I have a strong preference for naming things so that the room for ambiguity or confusion is a small as possible.

With that in mind, I see two approaches.

1️⃣ if making explicit (for humans) the year of the simulation campaign a given visit range belongs to, I would propose the following namespace organisation:

$ cd $TOPDIR
$ tree Run2.1i/
Run2.1i/
`-- sim
    |-- year1-wfd
    |   |-- 00000000to00071840
    |   |-- 00071840to00133541
    |   |-- 00133541to00201989
    |   `-- 00201989to00262897
    `-- year2-wfd
        |-- 00262897to00327707
        |-- 00327707to00385844
        |-- 00385844to00445379
        `-- 00445379to00497969

2️⃣ if making the year of the campaign is not necessary in the dataset, we could just keep the visit ranges one level up:

$ cd $TOPDIR
$ tree Run2.1i/
Run2.1i/
`-- sim
    |-- 00000000to00071840
    |-- 00071840to00133541
    |-- 00133541to00201989
    |-- 00201989to00262897
    |-- 00262897to00327707
    |-- 00327707to00385844
    |-- 00385844to00445379
    `-- 00445379to00497969

Note that I removed the outputs directory, which I think could be confusing when navigating the data and I don't think useful at this level.

In addition, I suggest the location of the tracking database file to be at the appropriate level, according to the files it tracks. I mean, if we are going to have one tracking database per simulated year, then I would locate the database file in directory .../Run2.1i/sim/year1-wfd. If the tracking database is intended to track the whole simulation campaign (i.e. 10 years) then, I woud put it in .../Run2.1i/sim.

Also, if it makes sense to separate data according to the field (e.g. WDF, UDF), I would suggest we store the visits under directories .../Run2.1i/sim/wdf and .../Run2.1i/sim/udf.

These suggestions don't take into consideration how difficult it could be to implement by @villarrealas in his production workflow, which should also be taken into account.

wmwv commented 5 years ago

Would it be correct to rename this Issue "Raw imSim Storage Format"->"Improve and Clarify Organization of Raw Run2.1 imSim files"

heather999 commented 5 years ago

We should finalize this discussion. I agree that the outputs subdirectory is not useful and should be omitted. At NERSC we will be storing all data under: /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i Under that, we could create a sim directory, @jchiang87 has previously referred to this as raw_files. I personally do not care what we call it, but want some agreement to be reached. sim might be more clear. We should stick with the naming convention of y1, y2, etc as discussed some weeks ago rather than year1, year2, etc..
We had agreed to refer to this 2nd batch as y2-wfd. We could consider referring to it as y1-y2-wfd if that provides more clarity and also allows us to maintain some ease in going back to the instance catalogs rather that artificially separating the y1 and y2 data. Going forward, do we anticipate that the next batches of Run2.1i data will be processed in one year increments? Are we generating WFD and uDDF separately? Will it be easy to store them under separate areas? And is that what we want?

So.. I might propose at NERSC:

  --sim
    |--   y1-y2-wfd
    |--   y3-wfd
    |--   y3-ddf
    |--   y4-wfd
    |--   y4-ddf

heather999 commented 5 years ago

@jchiang87 confirmed at the DM-DC2 mtg today that we can and should separate the data by WFD and DDF. I'll also note here that last week, @cwwalter mentioned it would be nice to at least have the object catalogs supplied by the data access team to be split by year. That doesn't directly mean we have to divide y1 and y2 sim data, but since @villarrealas suggested it is possible - perhaps we should go ahead and do that.

boutigny commented 5 years ago

@jchiang87 confirmed at the DM-DC2 mtg today that we can and should separate the data by WFD and DDF.

What do you mean exactly ? Should they be split at the raw image level only or at the catalog level too ? Regarding the catalog I thought (but I may be wrong) that we said that as DDF is included in the WFD footprint we should not distinguish between WFD and DDF in the workflow. Regarding y1, y2, ... I suppose that we should produce 1 catalog for y1, then 1 catalog for y1+y2 and so on... mimicking what LSST will do. Is that what you have in mind ?

jchiang87 commented 5 years ago

As usual, we need some documentation to know exactly what the simulated image data will comprise, but I had understood that the DDF visits will essentially be a separate simulation with a distinct cadence and would only include sensor-visits that overlapped with the DDF field. The WFD fields that overlap the DDF region would use the default cadence in the DESC version of the minion 1016 opsim db file.

boutigny commented 5 years ago

Ok but does it mean that we should do a separate processing for DDF and WFD resulting in 2 sets of catalogs ? I thought that the cadence was only relevant for the difference imaging pipeline. Am I missing something ?

jchiang87 commented 5 years ago

does it mean that we should do a separate processing for DDF and WFD resulting in 2 sets of catalogs ?

That is my understanding.

I thought that the cadence was only relevant for the difference imaging pipeline.

I had a misunderstanding of what the DDF cadence and visits were meant include and I still don't have a definitive picture in my mind yet, but I don't think this is true since the DDF visits need to have different dithering from the WFD versions in order for the DDF region to be fully covered by the DDF visits. Again, we need a more complete description of what is being planned for the DDF visits.

boutigny commented 5 years ago

I am moving this discussion on the reprocessing strategy to the issue #93

heather999 commented 5 years ago

I see no conclusion yet in #93, but we must get this batch2 data over to IN2P3. As of today, we are only dealing with WFD data and the first two years are stored together under one directory. What should we do to reach some conclusion - even if it is a temporary measure? My interest is just to get the transfer started. We can rename the Run2.1i-y2-wfd directory and place it under a sim subdirectory, like

sim
|--   y1-y2-wfd

@Antonio, can you comment on how long it might take to separate year 1 and year 2 data into separate directories?

boutigny commented 5 years ago

@heather999 I agree with your proposal. We need to move on now.

johannct commented 5 years ago

Shouldn't we separate y1 and y2 if we want to proceed with 6 mponths and then 1 year of data?

katrinheitmann commented 5 years ago

@villarrealas I thought you had done the separations in subset of years already? (quarterly or something) Or was that just the instance catalogs?

villarrealas commented 5 years ago

They're currently subset in quarterly. Let me restructure things by years (should be very short) and we should be good to go.

Edit: Nevermind. It has been requested to leave things as is, so I will hold off on this.

johannct commented 4 years ago

@villarrealas @airnandez Can we close this issue?

LSSTDESC / ImageProcessingPipelines

Improve and Clarify Organization of Raw Run2.1 imSim files #90