Closed texadactyl closed 3 years ago
See 2021-05-28 comment by @telegraphic in issue #201
Following up on some telescope_id
comments in #201, it should be noted that inferring n_coarse_chan
from telescope_id
is fraught with peril. The telescope_id
field (even if it were an ideal indicator of the telescope) is independent from the number of coarse channels and the coarse channel bandwidth. Any use of telescope_id
to infer n_coarse_chan
is relying on arbitrary conventions used by certain instruments at certain observatories. Such reliance on past convention makes the code prone to future failure when the conventions change or new telescopes come online. For example, coarse channels at MeerKAT can have one of three possible bandwidths depending on which mode the telescope is operating in.
IMHO, it would be preferable to deprecate the use of telescope_id
and move towards requiring the user/caller to specify some other datum that more dependently relates to n_coarse_chan
. coarse_chan_bw
could be given or, alternatively,n_fine_per_coarse
(aka npc
). I guess the fact that n_coarse_chan
is needed at all implies that there is something about the data that relates to the coarse channels (e.g. passband shape) so in theory it could be auto-detected, but that is a different can of worms.
Coarse channels have value to turbo_seti in terms of dividing up work for parallel processing (see __split_h5()
in data_handler.py
and FindDoppler.search()
in find_doppler.py
).
However, once the coarse channel count is dropped from Filterbank file and HDF5 file headers, does it still have usefulness in processing beyond dividing up work? I realise that the coarse channel number has meaning in the raw telescope data.
@telegraphic ?
@david-macmahon Re-read your comment again. I understand the dilemma.
The central issue is that the "assembly line" once had the number of coarse channels (OBSNCHAN
in the raw file header) and then threw it away in favor of the number of fine channels (nchans
in the filterbank file header). Also thrown away was the n_fine_per_coarse value. Hence, there is no accurate manner of obtaining or reviving the number of coarse channels.
This makes no sense at all to me since the number of coarse channels is required later on in the assembly line (search).
At some point, BL should do one of the following:
Its a shame that filterbank files lost the number of coarse channel field (#1). I would favour this action, even though other tools would be impacted.
However, if a radio astronomer could suggest a way forward with #2, that would also address this issue.
Current filterbank header fields that are unused by blimpy and only mentioned in io/sigproc.py
:
refdm
- reference dispersion measure (pc/cm**3)period
- folding period (s)nbeams
- total number of beamsThe nifs
header field is inconsistently handled and is not used for blocking/unblocking frequency lists to/from time-sample rows:
./guppi.py: fb_head["nifs"] = 1
./io/sigproc.py:# * nifs (int): number of seperate IF channels
./io/sigproc.py: 'nifs' : '<l',
./io/sigproc.py: n_ifs = h['nifs']
./io/fil_reader.py: self.n_beams_in_file = self.header['nifs'] #Placeholder for future development.
./io/fil_reader.py: n_ifs = self.header['nifs']
./io/hdf_reader.py: self.n_beams_in_file = self.header['nifs'] #Placeholder for future development.
./io/base_reader.py: selection_shape = (n_ints,int(self.header['nifs']),n_chan)
cc @telegraphic
Updating the Filterbank header to include new fields would essentially be creating a new file format. I am not keen to do that. We are moving towards using HDF5 files to hold our data. Not only does HDF5 support various forms of compression, it is also quite flexible. It is not as simple a format to use from a C programer's (or shell/dd
hacker's) perspective, but it is immensely richer than Filterbank. Instead of creating a new file format we would be better off, IMHO, migrating ourselves and our users to HDF5.
The only reasons I can think of for why turbo_seti needs to know anything about coarse channels are:
I'm not even sure how important these factors are. I understand that turbo_seti needs to partition the fine channels into more manageable subsets, but I don't understand why that has to be on coarse channel boundaries rather than any arbitrary number of fine channels. The two reasons I mentioned could be auto-detected if they are truly problematic enough to impede analysis. These are actually two artifacts of the coarse channelization process. They are not desirable and are not present in all data. The so called "DC spike" is the result of a "truncation vs rounding" shortcut (cheating) in the FPGA. The coarse channel edges are the result of using a critically sampled PFB rather than on oversampled PFB. The DC spike is not present in all our data (I think just GBT has this "feature"). The coarse channel edges are present in most of our datasets (since I think all our facilities use critically sampled PFBs), though the Parkes UWL receiver has very wide coarse channels (128 MHz) so it's less of an issue there.
What happens if you lie to turbo_seti about the number of coarse channels? Does it somehow fail or produce wildly different results?
Unfortunately we can't update the filterbank file format as @david-macmahon notes. We can however add 'n_coarse_channel' to the HDF5 file, which we have not done to date, but should do.
The DC spike removal and avoiding edges are the two main reasons why the coarse channel concept exists.
My vote is to put our effort into adding n_coarse_channel to the HDF5 file, and warning users if calc_n_coarse_chan is used that it is unreliable all will be deprecated (eventually!).
@telegraphic
Unfortunately, we are left with an unreliable way of automatically calculating n_coarse_chan
for the HDF5 header. I see little benefit in putting it into the header since we can calculate the value the same way each time as long as it is consistent. However, we should add a warning message to the blimpy function. On that.
@david-macmahon
@telegraphic
Second and third thoughts (I wish we could do this discussion face to face with a whiteboard!).
You seem to be suggesting that the n_coarse_chan
field should be written to the HDF5 header in the following circumstances:
n_coarse_chan
parameter and converts it internally to HDF5Potential blimpy writing impacts (write_to_hdf5
callers):
Blimpy readers too e.g. io/sigproc.py
, the various readers in io
, Waterfall.info
.
Suppose we did all that. What about the human factor? Users/callers have to remember that if they ran fil2h5
or turboSETI
once before, then the next turboSETI
execution or FindDoppler instantiation does not need the n_coarse_chan
parameter. In the non-MeerKAT settings, human behaviour suggests that they will use that parameter every time, either manually, in a bash script, or in a Python program - just to be safe in order to avoid a long execution with the wrong value.
On the other hand, for MeerKAT, it is conceivable that we could add to the standard pipeline or procedure (discipline) whereby for each Filterbank file produced, we do the following as the last step in order to produce the desired HDF5 file:
fil2h5 --n_coarse_chan=N ... # requires some of the new blimpy code mentioned above
For the HDF5 user, each subsequent turboSETI ...
execution or FindDoppler instantiation will take the n_coarse_chan
value out of the HDF5 header as it already has the option today in data_handler.py as I mentioned on Slack. No new turbo_seti code required. No need to express the number of coarse channels on the command line or as an argument to FindDoppler.
Side-effect: We will have 2 types of HDF5 headers in the wild.
cc @david-macmahon
I think part of my mental roadblock to including n_course_chan
is that it enshrines the concept of "coarse channel" in the metadata and it is actually the product of two other metadata points "number of fine channels" (nchans
) and "number of fine channels per coarse channel". This means that slicing/splicing the files in frequency space (e.g. to trim off unoccupied portions of the band edges or to combine adjacent spectra into one file) would require changing n_coarse_chan
as well as nchans
. For this reason I prefer the n_fine_per_coarse
concept, but that name also enshrines the concept of "fine" and "coarse" in the name. My issue with that is that is that it seems very two-stage-channelization-centric. What happens if we do a 3 stage channelization?
One possibility to address my concerns would be to introduce generic channelization fields nchans1
and nchans2
(and so on) that define the channelization at each stage. For GBT 0000.fil type products, these fields would be nchans1=512
and nchans2=1048576
. The nchans1
value is of little use without the total bandwidth of the observation (not just the given file), so perhaps only the final channelization stage (i.e. nchans2
or maybe nchans_final
) is meaningful.
Another complicating factor is what to do when the data are finely channelized, converted to power (aka "detected"), and then channel averaged to a lower spectral resolution. This is not likely to be used for the high spectral resolution product, but could be used for the other products to improve the channel isolation.
@david-macmahon
As promised, an answer to "What happens if you lie to turbo_seti about the number of coarse channels? Does it somehow fail or produce wildly different results?"
In a nutshell, you get different results, depending on the number of coarse channels used. Wildly? You or Danny would be a better judge.
cc: @telegraphic
@david-macmahon @telegraphic
I just reran turbo_seti for 2 cases of Voyager data. The number of coarse channels used can make a big difference in a turbo_seti run.
Default to blimpy which then calculates 1 coarse channel for Voyager: find_doppler.0 INFO Top hit found! SNR 30.612128, Drift Rate -0.392226, index 739933 find_doppler.0 INFO Top hit found! SNR 245.707984, Drift Rate -0.373093, index 747929 find_doppler.0 INFO Top hit found! SNR 31.220652, Drift Rate -0.392226, index 756037
Use the true value (64) of the number of coarse channels for Voyager: find_doppler.32 INFO Top hit found! SNR 250459.347795, Drift Rate -0.105231, index 0 find_doppler.32 INFO Top hit found! SNR 15718.665595, Drift Rate 3.979660, index 26 find_doppler.32 INFO Top hit found! SNR 15718.458163, Drift Rate -3.577867, index 16361 find_doppler.45 INFO Top hit found! SNR 153.731099, Drift Rate -0.392226, index 2653 find_doppler.45 INFO Top hit found! SNR 1237.702395, Drift Rate -0.373093, index 10649 find_doppler.46 INFO Top hit found! SNR 156.779181, Drift Rate -0.392226, index 2373
Those results seem to agree for the most part, but I'm not really sure how to compare their relative "correctness".
Another factor to consider with HDF5 files is that the data
dataset is compressed and therefore is "chunked" (an HDF5 term). I don't know whether turbo_seti loads the entire file into memory or accesses the data in subsets using HDF5's so called "hyperslabs". If it's the latter, you will likely get better performance if the number of channels analyzed per "batch" is a multiple of the chunk size along the channel axis. For the Voyager1.single_coarse.fine_res.h5
file, the chunk size in the channel/frequency axis is 16384:
$ h5dump -H -p -A 0 -d data Voyager1.single_coarse.fine_res.h5
HDF5 "Voyager1.single_coarse.fine_res.h5" {
DATASET "data" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 16, 1, 1048576 ) / ( 16, 1, 1048576 ) }
STORAGE_LAYOUT {
CHUNKED ( 1, 1, 16384 ) <============== Chuck size!!!
SIZE 50481355 (1.329:1 COMPRESSION)
}
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32008
COMMENT bitshuffle; see https://github.com/kiyo-masui/bitshuffle
PARAMS { 0 3 4 0 2 }
}
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE 0
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
@david-macmahon
Thanks!
HDF5 files are read/written in chunks by blimpy. Hyperslabs are not employed (probably didn't exist when the blimpy code was written by who?).
IMO, the current chunk-size selection methodology needs improvement. Danny agrees.
P.S. HDF5 support is alive in Julia. Hope it is in good health too.
@telegraphic @david-macmahon
I do not feel that we've made sufficient progress towards a resolution for turbo_seti i.e. a high-level design for an accurate division of the full frequency bandwidth to use in a FindDoppler.search() when there is no user specification.
Questions:
I put a warning in blimpy:io:base_reader:calc_n_coarse_chan().
Actually, this routine is used as a fallback (default) when the number of coarse channels is unknown to the caller. As such, it is not likely to get better since nobody so far has offered a high-level redesign. Therefore, I am closing this issue.
This function needs a review by a radio astronomer. It just caused issue #201
Current logic: