Closed jrs65 closed 2 years ago
_I posted this comment in the wrong PR. Deleting it from the ch_pipeline
PR and moving it here._
I've generated a new file that contains calibration times between 2018/09/02 and 2020/12/30. In the process, I fixed a problem with past files where the tref
quantity was converted to a float32 during processing, which resulted in up to a 60 second difference with the true transit time.
The file is currently saved on cedar at
/scratch/ssiegel/chime/calibration_times/20180902_20201230_calibration_times.h5
Previously I saved these in the project space at
/project/rpp-chime/chime/chime_processed/gain/
but it looks like I am no longer owner of that directory. Would you be able to return me as owner or grant me write privileges?
Here is the result of comparing the output of get_reference_times_file
and get_reference_times_dataset_id
for the periods where they overlap:
file_database_comparison_fix_float_fit_valid_expanded_v2.pdf
reftime
, reftime_prev
, interp_start
, and interp_stop
as a function of time. The file-based method is shown in black and the dataset-id-based method is shown in red.reftime
and reftime_prev
(top) and interp_stop
and interp_start
(bottom).Below I summarize the discrepancies in the quantities being output.
There are two cases between 2020-10-22 and 2020-11-01 where the dataset-id-based method moves on to the next update, but the file based method gets "stuck" on an old update.
The generation of the calibration time file involves searching the chimestack
acquisitions for times where the gains change from one sample to the next, and then matching the gains at those times to the gains saved in the chimegain
acquisitions. For some reason there are files missing from the 20201014T233944Z_chimestack_corr
acquisition in the chime_online
space. Some of these files contain gain transitions, and so those transitions are never registered during generation of the file. This is illustrated in the following figure that shows the start time of all available files in the chime_online
space (blue) and compares to the gain update times (red):
Perhaps this is due to chimestack
data with a dataflag being moved from the online space to nearline? I would have thought that, since this more recent data has not been processed by the pipeline, there would not be a dataflag indicating that it is bad.
Looking back through old slack messages, it appears the move of chimestack
to nearline started on June 2, 2021. The calibration times from 2020-04-01 to 2020-10-23 were created on Nov 2, 2020, so they shouldn't be affected by this move if it is indeed responsible. The file-based calibration times after 2020-10-23 are affected by this problem. Luckily we have the dataset-id-based method that we can use, although that has separate issues detailed below.
There are several cases where the dataset-id-based method stops registering a reference time. In all cases this corresponds to invalid dataset ids due to correlator or kotekan restarts:
gain_20201104T213200.360047Z_timing
). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart
dataflag.gain_20201107T054000.496272Z_timing
). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart
dataflag.gain_20201114T200428.578117Z_timing
). This is covered by a bad_calibration_fpga_restart
dataflag.gain_20201117T181841.190217Z_timing
). This is covered by a bad_calibration_fpga_restart
dataflag.gain_20201126T230630.094332Z_timing
). This is covered by a bad_calibration_fpga_restart
dataflag.2020-11-29 4:44:44 to 2020-11-30 04:57:22
gain_20201129T043826.610238Z_timing
). Note that this did not corrrespond to a correlator restart and is not covered by a bad_calibration_fpga_restart
dataflag.The calibration broker always multiplies the timing-only updates by the most recent point-source-based gains. For future data, we could modify the calibration broker code so that it includes the name of the source that was used in every update_id
.
For past data, we could create data flags after each acquisition restart, similar to how we create data flags for fpga restarts. Alternatively, we could change the code so that as it is moving forward through the times, if it encounters an invalid update it replaces it with the last valid update. This is probably a safe assumption for the more recent data for which the dataset-id-based method will be used. This would not work within the daily pipeline if the acquisition restart happened to occur near the boundary of a sidereal day. However, I suspect we may use this code for other stability analyses where larger spans of time are processed and for which this method should work.
The second page of the pdf file indicates that occasionally there are very brief intervals where the reference time in the file is one day larger than the reference time from the dataset id. The intervals here are actually single time samples that in all cases correspond to the start of a transition. The file-based method includes the first sample in the transition and sets reftime
to the new update and reftime_prev
to the old update. The dataset-id based method does not include the first sample in the transition and sets reftime
to the old update and reftime_prev
to NaN.
In the end this difference in conventions does not matter. The first sample of the transition will be set to the old update in either case because of how the interpolation factor is defined.
I think it would be good to use the same convention just to avoid confusion in the future, however it would take me a while to understand the indexing in get_reference_times_file
to the point that I could fix it to match the result of get_reference_times_dataset_id
.
The previously described problem with the file-based method due to missing chimestack
files shows up in the reftime_prev
quantity around 2020-11-01. The problem with the dataset-id-based method due to acquisitions restarts shows up around 2020-11-26.
Before 2020-11-01, there are a large number of instance where the file-based method identifies a reftime_prev
, but the dataset-id-based method does not. I have investigated one such instance on 2020-10-23. For some reason, the "gain ids" are not changing for transitional gain updates. The chimegain
acquisition contains the transitional gain updates and the gain
dataset in the chimestack
acquisition is transitioning in the expected way, however the "gain id" is not changing during each transition. This problem does not show up after 2020-11-01, so perhaps this is some initial bug during rollout that was subsequently fixed?
In cases where we are transitioning to a new update from a previous "invalid" update (due to correlator/acquisition restart), the dataset-id-based method uses the time of the most recent valid non-interpolated update as the interpolation start time.
I think this is a bug and we should instead use the time of the most recent valid OR invalid non-interpolated update.
The calibration time file covering 2020-04-01 to 2020-10-23 was created on Nov 2, 2020 and the calibration time file covering 2020-10-23 to 2020-12-31 was created only a few days ago. I seem to have lost the exact method that I was using for the former and as a result I was unable to properly identify all gain changes in the gain
dataset of the chimestack
files for the latter. I am still trying to figure this out, but as of right now the interp_stop
time ends up being roughly 2 minutes short compared to the dataset-id based method and the expected transition duration (5 minutes).
Here are the same set of plots after implementing the suggested changes:
file_database_comparison_fix_float_fit_valid_expanded_with_changes.pdf
Many of the discrepancies have been resolved. The two methods are still using different conventions for reftime
, which results in the 1 hour difference at the start of each update.
There is still one case around 2020-11-27 where the database-based method gives the incorrect reftime
and reftime_prev
. In this case, there was a correlator restart, so the update_id
was invalid, and before the restart was a deployment, so the chimestack
was not archived. As a result, the database-based method uses the reference time from the last valid update pre-deployment. However that interval is covered by a bad_calibration_fpga_restart
flag and would be excluded.
Should all be done now @ssiegelx. I've applied all your fixes. Thanks!
I've added two routines to lookup the calibration time from files (just pulled out of ch_pipeline) and from the dataset IDs (new).
I need to test this a bit more, but I think it's mostly done.