Speed up calibration for unaveraged data

jradcliffe5 commented 6 years ago

Hi @jmoldon,

I had a thought on a possible way to speed up calibration for unaveraged data. I was wondering whether you could derive calibration tables on averaged data (thus should be quicker to derive solutions) and then apply these solutions at the end to the unaveraged data? Ofc the process would cause extra disk space to be used as you'd have to have both unaveraged & averaged data and we would have to think about the steps (e.g. flagging) having to be run on both data sets.

I'm sure @amsrichards will know also!

jmoldon commented 6 years ago

Hi, these are my thoughts. It would probably speed up things, I think the VLA pipeline does that. Other benefits I see are:

You can also be more efficient flagging low-level RFIs in the calibrators,
More importantly, the plots would also be faster, which is still one of the slowest parts.

On the other hand, the problems I see are:

Additional time and disk to split the data. We need to measure total times to see the real benefit.
Keeping track of additional files. Not very problematic if it is automatized, but still.
You still need a full-resolution bandpass calibration, and you will need another bandpass for the averaged calibrators. So you will be mixing calibration from unaveraged and averaged calibrators data (possibly with different flags).
When reporting, it could be ambiguous where the plots are coming from.
I'm implementing some tasks to track the amount of flags after each step. It could be more complicated to keep track with more data files.

So no fundamental problems apart from building a robust logic to the process. I will think more about it, but I don't think I have time to implement a big change like this for the moment.

By the way, if you have some eMCP.log files, I would like to see them to identify bottlenecks compared to runs on other machines.

amsrichards commented 6 years ago

Hi,.. no, I don't really know! I think that the first step would be an analysis of where our slowest steps are. The fact that we have fewer antennas but (usually) finer time sampling may make it different from the VLA (or ALMA) case. In particular, if working with unaveraged data:

are we finding any steps even slower than expected just from data volume, i.e. are we running into swap space? is the probem I/O or memory or fundamental disk space?
is the bottleneck deriving the full-res gain tables, or applycal? With ALMA data it is usually the latter, i.e. writing the CORRECTED column. If we only need to apply full-res gain tables 'on the fly' e.g. using CAL Library (I never have ;-( ) maybe that helps. You would still need a full-resolution delay solution in case of those problems.

I have often used averaged ALMA continuum data to derive gain tables to apply to un-averaged spectral line data using two copies of the same data set and that works fine. However this is only with different amounts of spectral averaging and I am careful to make sure that all the averaged channels for each spw are the same (e.g. 512 input > 64 output). CASA tends to do odd things with orphan channels. I have also noticed that when time-averaging in CASA you often get a mix of integration times (since input scans are often of variable length), mstransform is quite intelligent in not leaving orphan very short integrations, e.g. if you ask for 12-s intervals in a 27-s scan you end up with 3x9-s. However, if the next scan is 36-s then you get 3x12-s. This is handled fine by the weights but might cause problems in the case that Javier mentioned i.e. the second data set being differently flagged (also applying to channels I guess). In principle fixable by statwt but maybe you end up with too many fixes taking up all the saved time....

Probably worth talkiing to VLA/ALMA pipeline people (I am sure you have already)...

e-merlin / eMERLIN_CASA_pipeline

Speed up calibration for unaveraged data #99