Calibration time lookup

jrs65 commented 2 years ago

I've added two routines to lookup the calibration time from files (just pulled out of ch_pipeline) and from the dataset IDs (new).

I need to test this a bit more, but I think it's mostly done.

ssiegelx commented 2 years ago

_I posted this comment in the wrong PR. Deleting it from the ch_pipeline PR and moving it here._

I've generated a new file that contains calibration times between 2018/09/02 and 2020/12/30. In the process, I fixed a problem with past files where the tref quantity was converted to a float32 during processing, which resulted in up to a 60 second difference with the true transit time.

The file is currently saved on cedar at /scratch/ssiegel/chime/calibration_times/20180902_20201230_calibration_times.h5

Previously I saved these in the project space at /project/rpp-chime/chime/chime_processed/gain/ but it looks like I am no longer owner of that directory. Would you be able to return me as owner or grant me write privileges?

Here is the result of comparing the output of get_reference_times_file and get_reference_times_dataset_id for the periods where they overlap:

file_database_comparison_fix_float_fit_valid_expanded_v2.pdf

page 1: shows reftime, reftime_prev, interp_start, and interp_stop as a function of time. The file-based method is shown in black and the dataset-id-based method is shown in red.
page 2: the difference between the file-based method and dataset-id-based method for times where both values are finite in units of hours.
page 3: same as page 2 but zooming in on +/- 15 minutes and shown in units of minutes.
page 4: the difference between reftime and reftime_prev (top) and interp_stop and interp_start (bottom).
page 5: same as page 4 but zooming in on +/- 15 minutes around the expected value of each quantity.

Below I summarize the discrepancies in the quantities being output.

reftime

missing files

There are two cases between 2020-10-22 and 2020-11-01 where the dataset-id-based method moves on to the next update, but the file based method gets "stuck" on an old update.

The generation of the calibration time file involves searching the chimestack acquisitions for times where the gains change from one sample to the next, and then matching the gains at those times to the gains saved in the chimegain acquisitions. For some reason there are files missing from the 20201014T233944Z_chimestack_corr acquisition in the chime_online space. Some of these files contain gain transitions, and so those transitions are never registered during generation of the file. This is illustrated in the following figure that shows the start time of all available files in the chime_online space (blue) and compares to the gain update times (red):

20201014_acq_file_jumps.pdf

Perhaps this is due to chimestack data with a dataflag being moved from the online space to nearline? I would have thought that, since this more recent data has not been processed by the pipeline, there would not be a dataflag indicating that it is bad.

Looking back through old slack messages, it appears the move of chimestack to nearline started on June 2, 2021. The calibration times from 2020-04-01 to 2020-10-23 were created on Nov 2, 2020, so they shouldn't be affected by this move if it is indeed responsible. The file-based calibration times after 2020-10-23 are affected by this problem. Luckily we have the dataset-id-based method that we can use, although that has separate issues detailed below.

acquisition restarts

There are several cases where the dataset-id-based method stops registering a reference time. In all cases this corresponds to invalid dataset ids due to correlator or kotekan restarts:

2020-11-04 21:32:00 to 2020-11-05 03:38:42
- kotekan was apparently restart and the query returns an invalid dataset id (gain_20201104T213200.360047Z_timing). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart dataflag.
2020-11-07 05:40:00 to 2020-11-08 02:59:11
- kotekan was apparently restart and the query returns an invalid dataset id (gain_20201107T054000.496272Z_timing). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart dataflag.
2020-11-15 01:22:14 to 2020-11-16 02:19:32
- There was a correlator / kotekan restart and the query returns an invalid dataset id (gain_20201114T200428.578117Z_timing). This is covered by a bad_calibration_fpga_restart dataflag.
2020-11-17 18:21:29 to 2020-11-17 19:08:37
- kotekan was apparently restart and the query returns an invalid dataset id (gain_20201117T181841.190217Z_timing). This is covered by a bad_calibration_fpga_restart dataflag.
2020-11-26 23:09:33 to 2020-11-27 01:40:37
- kotekan was apparently restart and the query returns an invalid dataset id (gain_20201126T230630.094332Z_timing). This is covered by a bad_calibration_fpga_restart dataflag.
2020-11-29 4:44:44 to 2020-11-30 04:57:22
- kotekan was apparently restart and the query returns an invalid dataset id (gain_20201129T043826.610238Z_timing). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart dataflag.
The calibration broker always multiplies the timing-only updates by the most recent point-source-based gains. For future data, we could modify the calibration broker code so that it includes the name of the source that was used in every update_id.

For past data, we could create data flags after each acquisition restart, similar to how we create data flags for fpga restarts. Alternatively, we could change the code so that as it is moving forward through the times, if it encounters an invalid update it replaces it with the last valid update. This is probably a safe assumption for the more recent data for which the dataset-id-based method will be used. This would not work within the daily pipeline if the acquisition restart happened to occur near the boundary of a sidereal day. However, I suspect we may use this code for other stability analyses where larger spans of time are processed and for which this method should work.

Change in conventions

The second page of the pdf file indicates that occasionally there are very brief intervals where the reference time in the file is one day larger than the reference time from the dataset id. The intervals here are actually single time samples that in all cases correspond to the start of a transition. The file-based method includes the first sample in the transition and sets reftime to the new update and reftime_prev to the old update. The dataset-id based method does not include the first sample in the transition and sets reftime to the old update and reftime_prev to NaN.

In the end this difference in conventions does not matter. The first sample of the transition will be set to the old update in either case because of how the interpolation factor is defined.

I think it would be good to use the same convention just to avoid confusion in the future, however it would take me a while to understand the indexing in get_reference_times_file to the point that I could fix it to match the result of get_reference_times_dataset_id.

reftime_prev

previously discussed problems

The previously described problem with the file-based method due to missing chimestack files shows up in the reftime_prev quantity around 2020-11-01. The problem with the dataset-id-based method due to acquisitions restarts shows up around 2020-11-26.

transitional updates not registering new gain ids

Before 2020-11-01, there are a large number of instance where the file-based method identifies a reftime_prev, but the dataset-id-based method does not. I have investigated one such instance on 2020-10-23. For some reason, the "gain ids" are not changing for transitional gain updates. The chimegain acquisition contains the transitional gain updates and the gain dataset in the chimestack acquisition is transitioning in the expected way, however the "gain id" is not changing during each transition. This problem does not show up after 2020-11-01, so perhaps this is some initial bug during rollout that was subsequently fixed?

interp_start

previously discussed problems

In cases where we are transitioning to a new update from a previous "invalid" update (due to correlator/acquisition restart), the dataset-id-based method uses the time of the most recent valid non-interpolated update as the interpolation start time.

I think this is a bug and we should instead use the time of the most recent valid OR invalid non-interpolated update.

interp_stop

problem identifying full transition with file-based method

The calibration time file covering 2020-04-01 to 2020-10-23 was created on Nov 2, 2020 and the calibration time file covering 2020-10-23 to 2020-12-31 was created only a few days ago. I seem to have lost the exact method that I was using for the former and as a result I was unable to properly identify all gain changes in the gain dataset of the chimestack files for the latter. I am still trying to figure this out, but as of right now the interp_stop time ends up being roughly 2 minutes short compared to the dataset-id based method and the expected transition duration (5 minutes).

ssiegelx commented 2 years ago

Here are the same set of plots after implementing the suggested changes:

file_database_comparison_fix_float_fit_valid_expanded_with_changes.pdf

Many of the discrepancies have been resolved. The two methods are still using different conventions for reftime, which results in the 1 hour difference at the start of each update.

There is still one case around 2020-11-27 where the database-based method gives the incorrect reftime and reftime_prev. In this case, there was a correlator restart, so the update_id was invalid, and before the restart was a deployment, so the chimestack was not archived. As a result, the database-based method uses the reference time from the last valid update pre-deployment. However that interval is covered by a bad_calibration_fpga_restart flag and would be excluded.

jrs65 commented 2 years ago

Should all be done now @ssiegelx. I've applied all your fixes. Thanks!

chime-experiment / ch_util