Step 4.0: Basic End-to-End

This is the first 'end-to-end' test.

Simulation Component: GSM, GLEAM, EoR (powerlaw), gains, xtalk, reflections, RFI
Simulators: hera_sim, RIMEz
Pipeline Components: redcal, smoothcal, abscal, xRFI,pspec`
Depends on: #41

Why this test is required

This is the most basic end-to-end test: using basic components for every part of the simulation, but enough to include all analysis steps.

Summary

A brief step-by-step description of the proposed test follows:

Obtain FG and EoR visibilities in the same way as #16.
Add gains in the same way as #16.
Add systematics in the same way as #22
Add RFI in the same way as #41
Run full makeflow pipeline on the simulated data

Simulation Details

Freq. range:
Channel width:
Baseline/antenna configuration:
Total integration time:
Number of realisations:

Criteria for Success

P(k) matches known input to within 1%

This expands on #41 by adding crosstalk and cable reflections, correct? If so, then, sometime down the road, we should come to a consensus on which crosstalk model we want to employ, as well as which parameters we want to use for the cable reflections.

Regarding the criteria for success, this seems a bit ambitious, and I'm also a little confused on what exactly it means. First bit of confusion: are we taking the "known input" to be the model power spectrum, or the power spectrum from the perfectly calibrated foreground + EoR visibilities? Second bit of confusion/remark on ambitiousness: matches to 1% doesn't really make sense in light of results from #32; to summarize, the agreement between different power spectrum estimations depends on spectral window, and not even the EoR-only power spectra match the analytic expectation to 1% in any spectral window--so it's a little hard to say what a good "measure of correctness" metric would be, and what a satisfactory value of such a metric would be.

Thanks @r-pascua. This is exactly the kind of discussion I was hoping to generate by creating these issues.

I think we should let the whole team weigh in on this. I think what we want to validate is that in the "window" the power spectrum matches to a certain tolerance. I also agree that 1% is arbitrary, and perhaps a little ambitious. Certainly willing to change the requirement if we can define a better one.

So, in light of a recent update to #32, 1% may or may not be ambitious, depending on which outputs we're trying to match to which expected values. If we're trying to match any calibrated output to an expected (analytic) power spectrum, then 1% is overly-ambitious. If we're happy with the calibrated results (the analysis pipeline end-products) matching the perfectly calibrated results, then 1% might be a realistic goal, at least for abscal.

I agree that the whole team should weigh in on this. Hopefully we learn some lessons from the validation tests that precede this that will help us determine what "success" looks like.

Flow of End-to-End:

Specs of simulation:

LST 4-10
All idealized baselines for HERA H1C.
Fagnoni beams (identical for all baselines).
All frequencies of HERA.
Noise from hera_sim and inflated to all baselines. @steven-murray and @r-pascua to handle this.
Fiducial gains based on #16 (2.1). Every antenna has fixed bandpass. Same every night, but drift over time.
Reference visibilities are actual visibilities. @jaguirre has more details but I didn't hear them due to bad internet speed. Not the same as #16 since we shouldn't include EoR (but we do have noise), and smoothed at some scale.
No actual RFI (or xRFI), just use flags (ala #21). @jsdillon can excise xrfi from pipeline and insert flags.
Use 10.7 second cadence to align with H1C in LST so there's no interpolation etc. for flags.
Systematics:
- Reflections: assume delays stay the same, but amplitudes change with LST. Should drift. @nkern can specify further. Should be the same night to night. Every vis gets random phase, but consistent night-to-night.
- Cross-coupling: random phase over nights, but same amplitude/delay
@jsdillon to run the LST binning
@jburba to do the in-painting and systematic subtraction

This post contains updates regarding preparation of data from discussion with @steven-murray and @jsdillon, as well as a suggestion for the analysis to be performed, in accordance with discussions with @nkern and the larger validation group. Some aspects of this post are rough ideas of what should be done, and discussion is greatly encouraged—we still need to nail down some of the data preparation and analysis parameters.

Data Preparation

We will choose 10 days from the H1C IDR2.2 data release. For each day, we will construct two base data sets as follows:

One set of data will only contain foregrounds (eGSM + GLEAM + brights); the other will contain foregrounds and the boosted EoR (this has a 1/k^2 dependence, but is stronger than the original EoR simulation).
Each data set will be slightly rephased to the LSTs in the IDR2.2 data files, and their time_array attributes will be modified to match the IDR2.2 data.
Only a subset of the simulated data will be used: roughly 6 hours for each night, and only baselines that are present in the IDR2.2 data. Extra work will likely be required to get antenna numbers to match.
After downselecting and rephasing the data, it will be inflated by redundancy.
The downselected, rephased, and fully inflated data will be chunked into files with the same number of integrations per file as the IDR2.2 data.

I propose using the following naming convention for the files: zen.<jd>.<jd>.<cmp>.rimez.uvh5, where <jd>.<jd> is the decimal representation of the Julian Date of the first integration in the file, out to five decimal places, and <cmp> is the component (e.g. foregrounds or true; more options possible if we decide to save more simulation components). I intend to make a directory for these to live in on lustre: /lustre/aoc/projects/hera/Validation/test-4.0.0/data. If there are any dissenting opinions on this convention, please make them known.

The data set containing only foregrounds will be used for extracting a true upper limit (that is, it's supposed to represent the case where EoR is hidden by the noise floor). The other data set will be used to see if we can detect EoR in a case where it should be above the noise floor for at least some delays in most spectral windows. @jaguirre should add clarification on this point if deemed necessary or if any of the information stated is incorrect.

For each of the above data sets, we will corrupt the data according to the following routine:

Thermal noise will be added per-baseline; @steven-murray will write the routine to do this.
Crosstalk will be added to each cross-correlation visibility for a range of delays and amplitudes, with randomized phases. The crosstalk visibilities may be saved to disk if desired, but someone should explicitly request that this be done by commenting on this thread (otherwise they will not be saved to disk). @r-pascua will write a routine for this.
Bandpass and reflection gains will be calculated per-antenna and applied to each visibility. The true gains will be saved to disk for reference. Unless otherwise instructed, the gains will have their phases randomized on a nightly basis, but be consistent throughout a night. @r-pascua will write a routine for this. @jaguirre suggested drift in gain amplitudes on a nightly basis—some clarification on this would be appreciated (e.g. do the gain amplitudes drift linearly? sinusoidally? randomly? is there drift over a single night, or just between nights?).

Ideally, the routines for each step will exist as functions in hera-validation, and, unless someone else volunteers, @r-pascua will write a routine that incorporates all of these functions into a single corrupt_simulated_data routine (actual name TBD). Before committing to this routine for the full 10 days' worth of data, we will run a test on a single day's worth (~6 hours). @r-pascua can make some plots visualizing data, and we can show the greater collaboration for approval regarding the level of realism of the simulation.

Data Processing

We should write a makeflow for calibrating the corrupted visibilities that is based on the IDR2.2 pipeline. @jsdillon should be the authority on this, but @r-pascua has experience writing/running an analysis makeflow. The do scripts for the makeflow should live ~somewhere in hera-validation~ in hera_opm.

For absolute calibration, we should use the GLEAM + brights foregrounds, with some level of noise, smoothed out to some maximum delay. @jaguirre and @nkern should confirm or refute this point, adding extra detail as necessary (e.g. up to what delay are we smoothing?).

Important note regarding analysis: we have agreed to not test xRFI, but to mock up the xRFI step. Since the simulated data will be perfectly aligned with the real data, we can just drop the flags from the real data into the simulated data. Based on the Flowchart of Doom, we should add the RFI flags to the calfits files that come out of post_redcal_abscal_run.py before running smooth_cal (@jsdillon should correct me if this is not the right way to do it).

@jsdillon to perform (or help with performing) the LST-binning step post-analysis.

We should write our own YAML files for use with the power spectrum pre-processing and the power spectrum pipeline scripts in https://github.com/HERA-Team/H1C_IDR2/tree/master/pipeline. @nkern should be the authority on this; @jburba has experience working with the pre-processing pipeline (and soon should have experience working with the power spectrum pipeline?). The configuration files used for this step should live somewhere in hera-validation.

Results and Presentation

This is a rather large project and cannot be run in a notebook, but we can still use a notebook for visualizing the data products at each stage of the test. This test will also constitute the meat of the validation paper, so we want to think very carefully about how we will be presenting our results. Below, I pose some questions that I think are important to answer—please add questions if you think of additional questions worth asking, and please offer answers for any questions you think you can answer, either completely or partially.

How do we want to present our work? For each simulated systematic, do we want to have comparison plots that show the accuracy of the best-fit solutions for those systematics, assuming the solutions are retrievable from every step? For per-antenna systematics, do we want to devise a way to visualize the accuracy of the solutions for the entire array simultaneously? What about per-baseline effects? What will be our criteria for success? What do we want the reader to take away from the paper, and how do we visualize those points?

Closing Remarks

Over the next few weeks, we should come to a consensus on who is responsible for each part of this test, and what set of parameters will be used for each step. My understanding is that @jaguirre and @steven-murray should manage task assignment, @jsdillon should be the go-to person for questions regarding the analysis pipeline and LST-binning, and @nkern should be the go-to person for running the power spectrum pre-processing and estimation pipelines (although @acliu and @saurabh-astro should also be able to assist). My current understanding of task assignment is as follows:

@steven-murray to develop tool for simulating noise
@r-pascua to prepare data for analysis (i.e. apply systematics, chunk files)
Someone to perform analysis, from redcal through smooth_cal. @r-pascua has done this before, but is juggling multiple projects.
@jsdillon to perform LST-binning
@jburba to perform post-analysis, pre-pspec processing (in-painting and systematics removal)
Someone to perform power spectrum estimation. @r-pascua and @jburba should both be familiar with how to do this.
Someone to put together notebook, complete with descriptions of each step, parameters used for each step, and visualization of results. Ideally, the notebook will help facilitate writing the paper.

Some comments:

I think we don't have to inflate the original files. We can inflate when we add noise and systematics, right?
Should crosstalk be applied before/after gains? I was thinking after, but @nkern should know best.

Everything else to me makes sense. I think the real take-home point (if everything works as expected) should be that the output power matches the input. Given we have the flagging, we're going to have to have plots like @jburba's plots from 3.1 -- where we show the input/output power for many different cases. In this particular test, I'm not sure how useful it is to focus on any one systematic, since we've already verified that each step should work well. Obviously, if the test fails, we're going to have to dig further.

I am happy to help with putting the notebook together and doing the scaffolding.

This sounds like a grea plan.

A few thoughts:

The do scripts for the makeflow should live somewhere in hera-validation.

Actually, I think they should live in hera_opm. We'll make a new pipeline in the pipelines folder. I think that's a cleaner comparison, IMO.

For absolute calibration, we should use the GLEAM + brights foregrounds, with some level of noise, smoothed out to some maximum delay.

I don't think we strictly need to add noise to the abscal model. This feels like an unnecessary complication that we could skip. Our abscal model already has the lack of realism from not being CASA-calibrated data.

That said, if the EoR level is large, that might do weird things to the abscal if it's in the data and not in the model. Maybe it'll be fine... I'm not sure. A safer test would be to include the EoR in the abscal model (when the data has EoR in it).

Someone to perform analysis, from redcal through smooth_cal. @r-pascua has done this before, but is juggling multiple projects.

We can work together on this @r-pascua, though I'm fine taking the lead.

I think we don't have to inflate the original files. We can inflate when we add noise and systematics, right?

Agreed. Also, let's make certain that the noise level in the data matches the expected noise from the autocorrelations (see, e.g. this function: https://github.com/HERA-Team/hera_cal/blob/c704901d45104e8d61f5015afac5f222bf36cdcf/hera_cal/noise.py#L37 )

Thanks for the comments @steven-murray and @jsdillon! I have just a few responses to some of the points the two of you raised; you may assume that I agree and have nothing further to add to any points not addressed in this comment.

I'll update the original comment to reflect changes we agree should be made to the proposed plan of action. First change will address where the do scripts live—I'm happy to make this change without further discussion.

I personally think we should have a focused discussion on what we'll use for the abscal model, and who will create it, at the next telecon—I think a discussion led by @jsdillon, @jaguirre, and @nkern would be very productive on this end. Please push back on this point if you disagree or do not completely agree.

Note taken regarding inflating files when adding systematics. I'll be developing the file preparation script today (involving rephasing, relabeling of antennas, downselection to only include antennas present in IDR2.2 data, and chunking). Review by @steven-murray and @jsdillon would be appreciated.

HERA-Team / hera-validation