UCBerkeleySETI / blimpy

Breakthrough Listen I/O Methods for Python
https://blimpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
48 stars 40 forks source link

blimpy:io:base_reader:calc_n_coarse_chan() needs a new design #202

Closed texadactyl closed 3 years ago

texadactyl commented 3 years ago

This function needs a review by a radio astronomer. It just caused issue #201

Current logic:

blimpy io/base_reader.py
calc_n_coarse_chan(coarse channel bandwidth)
===============================================================================
Total bandwidth = abs(f_stop - f_start), both fields are from the header).
The number of fine channels is specified in the header.
If the coarse channel bandwidth parameter is specified:
    Return the total bandwidth divided by the coarse channel bandwidth.
If the number of fine channels < 2^20 (common FFT length):
    If the telescope is GBT:
        Return the total bandwidth divided by 2.9296875.
    Else:
        Raise an error: "hires nchans not divisible by 2^20 and not GBT".
Note: At this point, the number of fine channels >= 2^20.
If the number of fine channels is evenly divisible by 2^20:
    Return the number of fine channels / 2^20.
Note: At this point, the number of fine channels is *not* evenly divisible by 2^20.
If GBT:
    Return he total bandwidth divided by 2.9296875.
Else:
    Raise an error: "not hires and not GBT".
texadactyl commented 3 years ago

See 2021-05-28 comment by @telegraphic in issue #201

david-macmahon commented 3 years ago

Following up on some telescope_id comments in #201, it should be noted that inferring n_coarse_chan from telescope_id is fraught with peril. The telescope_id field (even if it were an ideal indicator of the telescope) is independent from the number of coarse channels and the coarse channel bandwidth. Any use of telescope_id to infer n_coarse_chan is relying on arbitrary conventions used by certain instruments at certain observatories. Such reliance on past convention makes the code prone to future failure when the conventions change or new telescopes come online. For example, coarse channels at MeerKAT can have one of three possible bandwidths depending on which mode the telescope is operating in.

IMHO, it would be preferable to deprecate the use of telescope_id and move towards requiring the user/caller to specify some other datum that more dependently relates to n_coarse_chan. coarse_chan_bw could be given or, alternatively,n_fine_per_coarse (aka npc). I guess the fact that n_coarse_chan is needed at all implies that there is something about the data that relates to the coarse channels (e.g. passband shape) so in theory it could be auto-detected, but that is a different can of worms.

texadactyl commented 3 years ago

Coarse channels have value to turbo_seti in terms of dividing up work for parallel processing (see __split_h5() in data_handler.py and FindDoppler.search() in find_doppler.py).

However, once the coarse channel count is dropped from Filterbank file and HDF5 file headers, does it still have usefulness in processing beyond dividing up work? I realise that the coarse channel number has meaning in the raw telescope data.

@telegraphic ?

texadactyl commented 3 years ago

@david-macmahon Re-read your comment again. I understand the dilemma.

The central issue is that the "assembly line" once had the number of coarse channels (OBSNCHAN in the raw file header) and then threw it away in favor of the number of fine channels (nchans in the filterbank file header). Also thrown away was the n_fine_per_coarse value. Hence, there is no accurate manner of obtaining or reviving the number of coarse channels.

This makes no sense at all to me since the number of coarse channels is required later on in the assembly line (search).

At some point, BL should do one of the following:

  1. update the filterbank header to include the number of coarse channels, causing
  2. update turbo_seti data_handler.py to create a different partitioning of the collection of fine channel frequencies, based on other criteria (?) for the purpose of searching.

Its a shame that filterbank files lost the number of coarse channel field (#1). I would favour this action, even though other tools would be impacted.

However, if a radio astronomer could suggest a way forward with #2, that would also address this issue.

Current filterbank header fields that are unused by blimpy and only mentioned in io/sigproc.py:

The nifs header field is inconsistently handled and is not used for blocking/unblocking frequency lists to/from time-sample rows:

./guppi.py:        fb_head["nifs"] = 1
./io/sigproc.py:#   * nifs (int): number of seperate IF channels
./io/sigproc.py:    'nifs'         : '<l',
./io/sigproc.py:    n_ifs   = h['nifs']
./io/fil_reader.py:            self.n_beams_in_file = self.header['nifs'] #Placeholder for future development.
./io/fil_reader.py:        n_ifs   = self.header['nifs']
./io/hdf_reader.py:            self.n_beams_in_file = self.header['nifs'] #Placeholder for future development.
./io/base_reader.py:        selection_shape = (n_ints,int(self.header['nifs']),n_chan)

cc @telegraphic

david-macmahon commented 3 years ago

Updating the Filterbank header to include new fields would essentially be creating a new file format. I am not keen to do that. We are moving towards using HDF5 files to hold our data. Not only does HDF5 support various forms of compression, it is also quite flexible. It is not as simple a format to use from a C programer's (or shell/dd hacker's) perspective, but it is immensely richer than Filterbank. Instead of creating a new file format we would be better off, IMHO, migrating ourselves and our users to HDF5.

The only reasons I can think of for why turbo_seti needs to know anything about coarse channels are:

  1. "DC spike" removal
  2. Avoiding analysis of the coarse channel "edges"

I'm not even sure how important these factors are. I understand that turbo_seti needs to partition the fine channels into more manageable subsets, but I don't understand why that has to be on coarse channel boundaries rather than any arbitrary number of fine channels. The two reasons I mentioned could be auto-detected if they are truly problematic enough to impede analysis. These are actually two artifacts of the coarse channelization process. They are not desirable and are not present in all data. The so called "DC spike" is the result of a "truncation vs rounding" shortcut (cheating) in the FPGA. The coarse channel edges are the result of using a critically sampled PFB rather than on oversampled PFB. The DC spike is not present in all our data (I think just GBT has this "feature"). The coarse channel edges are present in most of our datasets (since I think all our facilities use critically sampled PFBs), though the Parkes UWL receiver has very wide coarse channels (128 MHz) so it's less of an issue there.

What happens if you lie to turbo_seti about the number of coarse channels? Does it somehow fail or produce wildly different results?

telegraphic commented 3 years ago

Unfortunately we can't update the filterbank file format as @david-macmahon notes. We can however add 'n_coarse_channel' to the HDF5 file, which we have not done to date, but should do.

The DC spike removal and avoiding edges are the two main reasons why the coarse channel concept exists.

My vote is to put our effort into adding n_coarse_channel to the HDF5 file, and warning users if calc_n_coarse_chan is used that it is unreliable all will be deprecated (eventually!).

texadactyl commented 3 years ago

@telegraphic

Unfortunately, we are left with an unreliable way of automatically calculating n_coarse_chan for the HDF5 header. I see little benefit in putting it into the header since we can calculate the value the same way each time as long as it is consistent. However, we should add a warning message to the blimpy function. On that.

@david-macmahon

texadactyl commented 3 years ago

@telegraphic

Second and third thoughts (I wish we could do this discussion face to face with a whiteboard!).

You seem to be suggesting that the n_coarse_chan field should be written to the HDF5 header in the following circumstances:

Potential blimpy writing impacts (write_to_hdf5 callers):

Blimpy readers too e.g. io/sigproc.py, the various readers in io, Waterfall.info.

Suppose we did all that. What about the human factor? Users/callers have to remember that if they ran fil2h5 or turboSETI once before, then the next turboSETI execution or FindDoppler instantiation does not need the n_coarse_chan parameter. In the non-MeerKAT settings, human behaviour suggests that they will use that parameter every time, either manually, in a bash script, or in a Python program - just to be safe in order to avoid a long execution with the wrong value.

On the other hand, for MeerKAT, it is conceivable that we could add to the standard pipeline or procedure (discipline) whereby for each Filterbank file produced, we do the following as the last step in order to produce the desired HDF5 file: fil2h5 --n_coarse_chan=N ... # requires some of the new blimpy code mentioned above

For the HDF5 user, each subsequent turboSETI ... execution or FindDoppler instantiation will take the n_coarse_chan value out of the HDF5 header as it already has the option today in data_handler.py as I mentioned on Slack. No new turbo_seti code required. No need to express the number of coarse channels on the command line or as an argument to FindDoppler.

Side-effect: We will have 2 types of HDF5 headers in the wild.

cc @david-macmahon

david-macmahon commented 3 years ago

I think part of my mental roadblock to including n_course_chan is that it enshrines the concept of "coarse channel" in the metadata and it is actually the product of two other metadata points "number of fine channels" (nchans) and "number of fine channels per coarse channel". This means that slicing/splicing the files in frequency space (e.g. to trim off unoccupied portions of the band edges or to combine adjacent spectra into one file) would require changing n_coarse_chan as well as nchans. For this reason I prefer the n_fine_per_coarse concept, but that name also enshrines the concept of "fine" and "coarse" in the name. My issue with that is that is that it seems very two-stage-channelization-centric. What happens if we do a 3 stage channelization?

One possibility to address my concerns would be to introduce generic channelization fields nchans1 and nchans2 (and so on) that define the channelization at each stage. For GBT 0000.fil type products, these fields would be nchans1=512 and nchans2=1048576. The nchans1 value is of little use without the total bandwidth of the observation (not just the given file), so perhaps only the final channelization stage (i.e. nchans2 or maybe nchans_final) is meaningful.

Another complicating factor is what to do when the data are finely channelized, converted to power (aka "detected"), and then channel averaged to a lower spectral resolution. This is not likely to be used for the high spectral resolution product, but could be used for the other products to improve the channel isolation.

texadactyl commented 3 years ago

@david-macmahon

As promised, an answer to "What happens if you lie to turbo_seti about the number of coarse channels? Does it somehow fail or produce wildly different results?"

In a nutshell, you get different results, depending on the number of coarse channels used. Wildly? You or Danny would be a better judge.

voyager_runs.zip

cc: @telegraphic

texadactyl commented 3 years ago

@david-macmahon @telegraphic

I just reran turbo_seti for 2 cases of Voyager data. The number of coarse channels used can make a big difference in a turbo_seti run.

Default to blimpy which then calculates 1 coarse channel for Voyager: find_doppler.0 INFO Top hit found! SNR 30.612128, Drift Rate -0.392226, index 739933 find_doppler.0 INFO Top hit found! SNR 245.707984, Drift Rate -0.373093, index 747929 find_doppler.0 INFO Top hit found! SNR 31.220652, Drift Rate -0.392226, index 756037

Use the true value (64) of the number of coarse channels for Voyager: find_doppler.32 INFO Top hit found! SNR 250459.347795, Drift Rate -0.105231, index 0 find_doppler.32 INFO Top hit found! SNR 15718.665595, Drift Rate 3.979660, index 26 find_doppler.32 INFO Top hit found! SNR 15718.458163, Drift Rate -3.577867, index 16361 find_doppler.45 INFO Top hit found! SNR 153.731099, Drift Rate -0.392226, index 2653 find_doppler.45 INFO Top hit found! SNR 1237.702395, Drift Rate -0.373093, index 10649 find_doppler.46 INFO Top hit found! SNR 156.779181, Drift Rate -0.392226, index 2373

david-macmahon commented 3 years ago

Those results seem to agree for the most part, but I'm not really sure how to compare their relative "correctness".

david-macmahon commented 3 years ago

Another factor to consider with HDF5 files is that the data dataset is compressed and therefore is "chunked" (an HDF5 term). I don't know whether turbo_seti loads the entire file into memory or accesses the data in subsets using HDF5's so called "hyperslabs". If it's the latter, you will likely get better performance if the number of channels analyzed per "batch" is a multiple of the chunk size along the channel axis. For the Voyager1.single_coarse.fine_res.h5 file, the chunk size in the channel/frequency axis is 16384:

$ h5dump -H -p -A 0 -d data Voyager1.single_coarse.fine_res.h5 
HDF5 "Voyager1.single_coarse.fine_res.h5" {
DATASET "data" {
   DATATYPE  H5T_IEEE_F32LE
   DATASPACE  SIMPLE { ( 16, 1, 1048576 ) / ( 16, 1, 1048576 ) }
   STORAGE_LAYOUT {
      CHUNKED ( 1, 1, 16384 )    <==============    Chuck size!!!
      SIZE 50481355 (1.329:1 COMPRESSION)
   }
   FILTERS {
      USER_DEFINED_FILTER {
         FILTER_ID 32008
         COMMENT bitshuffle; see https://github.com/kiyo-masui/bitshuffle
         PARAMS { 0 3 4 0 2 }
      }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_ALLOC
      VALUE  0
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}
texadactyl commented 3 years ago

@david-macmahon

Thanks!

HDF5 files are read/written in chunks by blimpy. Hyperslabs are not employed (probably didn't exist when the blimpy code was written by who?).

IMO, the current chunk-size selection methodology needs improvement. Danny agrees.

P.S. HDF5 support is alive in Julia. Hope it is in good health too.

texadactyl commented 3 years ago

@telegraphic @david-macmahon

I do not feel that we've made sufficient progress towards a resolution for turbo_seti i.e. a high-level design for an accurate division of the full frequency bandwidth to use in a FindDoppler.search() when there is no user specification.

Questions:

I put a warning in blimpy:io:base_reader:calc_n_coarse_chan().

texadactyl commented 3 years ago

Actually, this routine is used as a fallback (default) when the number of coarse channels is unknown to the caller. As such, it is not likely to get better since nobody so far has offered a high-level redesign. Therefore, I am closing this issue.