Decide on Run2.1 processing strategy

boutigny commented 5 years ago

I open this issue to discuss and decide on the Run2.1 data processing strategy. In a first step this discussion is restricted to the current workflow which is not including DIA. We have:

WFD and DDF fields and DDF is included inside WFD
Data for y1 and y2 are grouped all together but can be separated if necessary
One dataset per year and per field for years > 3

In the following, when I write "process", I mean something which is producing a well identified independent set of catalogs. So the questions are:

should we process WFD and DDF independently ?
should we process each year independently, possibly with y1 and y2 grouped together ?
should we mimic LSST DRP by running independent processing for y1, y1+y2, y1+y2+y3, ... ?

wmwv commented 5 years ago

I suggest that we should

Produce separate WFD-only and a DDF data product
- The DDF-only could/should include the WFD observations in that region.
Mimic LSST DRP
- At a minimum that's y1 and y1+y2+y3.
- We could consider further mimicking the plans by processing/releasing: DR1: First 6-months of data DR2: y1 DR3: y1+y2 DR4: y1+y2+y3

boutigny commented 5 years ago

I would like to understand the rationale to have 2 distinct catalogs for DDF and WFD. Selecting DDF in a global catalog is just a cut on a ra, dec bounding box ?

johannct commented 5 years ago

I think it has to do with the fact that the DDF is being tailored to SN needs for light curve investigations, and that the resulting simulation might be sufficently discrepant with WFD that we prefer to keep them separated.

johannct commented 5 years ago

I add that there will be a few hundred visits with unrealistically good PSF, and I think it may be educative to run calexp on them, but I am not sure we want to coadd them with the rest. So I wonder whether they should also be separate. Comments?

johannct commented 5 years ago

By the way we need other decision takers involved in all this.... @rmandelb @katrinheitmann ?

jchiang87 commented 5 years ago

We need to make a distinction between two different kinds of "DDF" visits:

1) There are ~26,000 visits that overlap with the DDF region in the DESC version of the minion 1016 opsim db. The instance catalogs for these visits are being generated and put in /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/instCat_ddf_minion at NERSC. These instance catalogs will have the observation times and dithering offsets that were derived by Humna et al..

2) In parallel, the transient science groups have come up with an alternative observing sequence for the visits that overlap with the DDF. These visits have modified observing times and restricted dithering offsets so that the DDF region is fully contained within the FOV for these visits. There are 26,378 visits in total, and the sensors to be simulated for those "shuffled" visits will only include those that intersect the DDF region. These instance catalogs have been produced in full and are in /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/instCat_ddf_shuffled at NERSC. To avoid confusion, these instance catalogs have visit numbers starting with 2500000; minion 1016 only goes up to 2400000 or so.

For the non-DIA processing we should include the instCat_ddf_minion visits with the WFD visits and process them all together as if they were one coherent dataset, subdividing into DR1, DR2, etc. as appropriate. For all practical purposes, we can simply regard all visits that use the DESC minion 1016 pointing info as "WFD visits".

For the DIA processing, my understanding is that only the instCat_ddf_shuffled visits would be used.

boutigny commented 5 years ago

We discussed this issue during today's DM-DC2 telecon. It seems to be a good plan to process the first 6 months (WFD) as soon as possible and to deliver a first set of catalogs to the AWGs. It will allow us to go through the whole workflow and to find and fix the remaining problems. But then, we still have to decide how to proceed for the next step. I would especially like to understand if there is a scientific interest in having a separate catalog for the 1st year only or if we can simply produce y1+y2 all together.

cwwalter commented 5 years ago

Hi All,

I'm in Japan now so wasn't at the meeting this morning. I think as long as you have a way to clearly ask for just the 1st year of visits in one uniform catalog that should be equivalent/enough. You can always produce reduced data sets from it if that is necessary. I think the only real requirement is that we should be able to study what 1 year of data looks like in a straight-forward way.

wmwv commented 5 years ago

@cwwalter Few people want the sets of individual visits. Most people want the static products from the coadds. Producing a catalog for 6-month, 1-year, and 2-year requires making coadds for each of those, running detection for each of these, and producing catalogs. That was @boutigny 's question.

cwwalter commented 5 years ago

Ah right... Forgot about the co-adds. Thanks. I think it probably is worth having yr1 co-adds given that we will probably want to do analysis on the yr1 data release.

wmwv commented 5 years ago

I agree with the plan to produce a 6-month product. I think the argument to produce a year-one product is a programatic one rather than scientific. We talk a lot about our year-one analyses as a concrete and first goal. But, to be honest, talking about "Year One" analyses intentionally elides the distinction between analyses based on the first year of data (which might be available 6-12 months after the first year) or analyses based on the data published in the first year (which will be based on the the first 6 months of observations).

rmandelb commented 5 years ago

The DESC SRD Y1 forecasts were quite specifically for @wmwv 's first option, but I agree it's not obvious whether our first science will in practice be based on 1-year vs. 6-month coadds. Moreover, regardless of which is used for our first cosmology analysis, it's not clear to me that the near-term DC2 science papers specifically require Y1 catalogs. For example, for WL/LSS/PZ/CL it's very likely useful to have a more shallow and eventually a deeper option, to explore analysis pipeline performance in different regimes where different systematic uncertainties dominate. Whether that shallower/deeper combination needs to be Y1 and then something deeper, or 6 months and then something deeper, is not clear to me (especially given there's a practical reason already to start with 6 months). Would it be useful for me or @reneehlozek to ping some of the static analysis teams with this query? Possibly the most useful framing is "If you will have object catalogs based on 6 months and 2 year coadds, is it also important for your DC2 science papers to have a 1-year coadd?"

Also it's valuable for us to remember that this is about the DC2-era science with these simulations. We may decide that an important activity during the DC3 era is reprocessing DC2 sims with an updated version of the DM stack in 1.5 years from now, and at that point we'd also have the freedom to choose different time periods for coadds.

wmwv commented 5 years ago

Infused with clarity by the wisdom of our Analysis Coordinator, I advocate for

6-month process (DR1 analog)
y1+y2 processing (DR3 analog)

boutigny commented 5 years ago

In any case this can be seen as a short/medium term strategy, we can always produce y1-only later if necessary.

wmwv commented 5 years ago

@johannct Can you help me check my understanding:

v1 of the first 6-month catalog was processed. These data have been transferred to NERSC and include single-frame, coadd, object catalogs, and metacal catalogs.
A new processing of a different 6-month set of data is now running. This has completed single-frame and is now going through coadd (multiBandDriver) processing.
- I understand this new processing is still being called "v1".
- I would like to kindly request that this be called "v2" to help distinguish from the previous processing where data products have been distributed.

Is my understanding correct? What will be necessary to ensure that the current 6-month processing is called "v2"?

johannct commented 5 years ago

@wmwv I initially only computed calexps for the first 6mths of y1, and then built the coadds. This is v1. I then completed calexps for all y1-y2-wfd into v1 (because I need to have them all in the same place to do coadds on the full 2 years as planned). So the new coadds for the new 6mths will be a v2, but not the calexps. And I plan to put y3 calexps in v1 as well. Dealing with several calexp dir seems hard to me, but I have not experimented. (even if I could do calexp-v1:calexp-v2:coadd-v2, I would need to build into the pipeline a non trivial rerun logic to get it right at all steps, it is not worth it as all this goes away with gen3)

johannct commented 5 years ago

(and I did not start the coadds for the new 6mths yet....)

wmwv commented 5 years ago

Thanks, @johannct . That totally makes sense to me.

wmwv commented 5 years ago

With thanks to @johannct and @yymao for helping me think more clearly about this, I propose:

We should use labels such as (dr1, dr2, dr3, y1, y2, ...) to refer to different choices of subsets of data to analyze.
We should use v1, v2, v3 to refer to different pipeline processing versions/configs.

DPDD Label	DM Rerun Label	Time Range	Description
dr1a_v1	coadd-dr1a-v1	First 6 months	The first released products from the Run 2.1i processing (currently on disk as `coadd-v1`).
dr1b_v1	coadd-dr1b-v1	A different 6 months	The new set of products that should be completed this week. This time range doesn't have quite a logical equivalent in the planned LSST processing, but in the sense of being a processing of 6 months of reasonable data it is a good analog for what we may see in 300 sq. deg of DR1.
dr2_v1	coadd-dr2-v1	First year of data
dr3_v1	coadd-dr3-v1	First two years of data
dr4_v1	coadd-dr4-v1	First three years of data	Suggested as the next processing to run.

...

At this point I believe we should plan to produce dr1a_v1 (done), dr1b_v1 (in process), and then skip to dr4_v1.

I.e., specifically, the imminent processing should not be coadd-v2 but rather coadd-dr1b-v1.

@boutigny @johannct @jchiang87 Does this sound reasonable to you?

johannct commented 5 years ago

ok for coadd-dr1b-v1 for the upcoming new 6 mths. Note that dr2-v1 shares the same issue as dr1a-v1 : it is a very depleted first year. I am fine with launching 3 years immediately afterwards, but shouldn't we make sure that this is ok with people investigating DIA?

wmwv commented 5 years ago

shouldn't we make sure that this is ok with people investigating DIA?

Do you mean for the purpose of generating template coadds for subtraction?

wmwv commented 5 years ago

Generating optimal (or even good) coadds for template subtraction remains a separate thing that we will have to do. For templates we are happy to take a significant loss of depth in order to have well-controlled and small PSFs.

johannct commented 5 years ago

I know, but they might want to resort to something intermediary...

JoanneBogart commented 5 years ago

Concerning @wmwv 's table above - the use of dr and v makes sense. I don't see anything corresponding to the 2 in Run2.1 (nor anything to indicate Imsim rather than Phosim, but if all future releases are Imsim I don't mind dispensing with that).

yymao commented 5 years ago

I thought those names are in addition to Run2.1i? For example, the object catalog in GCRCatalogs would be named as dc2_object_run2.1i_dr1a_v1?

wmwv commented 5 years ago

The full name of the catalog as provided in the GCR would be dc2_object_run2.1i_dr1a_v1

JoanneBogart commented 5 years ago

Ah, ok. I missed that somehow.

JoanneBogart commented 5 years ago

It doesn't quite carry over to Postgres schema naming, at least as I've done it so far. Objects and forced source for run1.2p_v4 are in the same schema.

yymao commented 5 years ago

For Postgres schema I think it only needs to start with run... (i.e. get rid of dc2_object_). But we should also make sure that the table name object, source etc are consistent with GCRCatalogs.

JoanneBogart commented 5 years ago

Ok. That can easily be changed for future releases, starting with upcoming v2. Also for schema name I might want to substitute something else for . to avoid having to quote in interactive use (not an issue from Python), but if that is too confusing for users (different conventions for GCR and Postgres) I can live with it as is.

johannct commented 4 years ago

archived

LSSTDESC / ImageProcessingPipelines

Decide on Run2.1 processing strategy #93