Open pravirkr opened 3 years ago
Hey Pravir, finally found some time to respond to this!
I am pleased you are thinking about this. The FRB community needs a really solid software package that is well equipped for the future. So I think a refactor is well motivated, but I'm actually wondering whether it's better to start a completely new package, and port the useful code over to it?
I love the idea of having a profile, block and cube class. Can these support multiple beams and polarization too? I recently played around with xarray, which I conceptually like the idea of as a base data structure -- from which you could define a profile, cube and block. .
However there are two issues I've found for high resolution data:
1) The 'coords' require full numpy arrays, not just a start value and step size. Here's a github discussion I started which nobody could answer.
2) I really think units should be built in, similar to astropy's unit arrays (although pint-xarray tries to add this).
3) No support for 1,2 or 4-bit data. I really like how h5py
and np.memmap()
provide access to data, but the need to support low bitwidth means you can't use straightforward memory mapping.
I got a bit carried away and tried to roll my own xarray-like DataArray
array class in a pet project called hyperseti, the method to read from a filterbank is here. This got much messier than I anticipated so not sure it's the best solution! I do however think the overall xarray concept is very attractive -- your data should have labels that describe it well (I would add units too!).
As well as profile/block/cube, should it support voltage-level products? Perhaps we could learn from baseband, which looks to have a task framework including dedispersion. (I'm not very familiar with these)
In terms of unit/coordinates/time handling, I think astropy does a decent job so would be a reasonable dependency. I like the idea of ALL arrays having units attached (i.e. as an astropy.Quantity
array). I'm not as big of a fan of numba
, but it is very reasonable to use it. I quite like pybind (in preference to cython), but perhaps only used sparingly (e.g. do we really need FFTW3?). cupy
is easy for GPU acceleration, but not everyone has an NVIDIA GPU, so only an optional dependency if possible.
My opinion is that the sigproc header is too dated, and it doesn't support polarization, so I would suggest using a new and improved internal data structure. (Also storing angles as a float inDDMMSS.ss
is quite frankly insane). With some thought we could come up with a more suitable list of keywords: e.g I would prefer (f
and df
) to (fch1
and foff
), (t
and dt
) to (tstart
and tsamp
), or one could use more general [var]_start
and [var]_step
, e.g. freq_start
, time_start
to have consistency.
I guess the most important for adoption is that the package is easy to use for current use cases, and that it can support next-generation use cases too. The UWL and narrow-bandwidth FRBs make a good case study: sub-band searches will probably become more common. How easy would it be to pipe data into tensorflow? How about multiple beams? How about parallelization across nodes (e.g. dask), paralleization on CPUS (openmp) or GPU acceleration (cupy)?
It is easy to have 'feature creep' and make the task huge, so it will be important to set a scope and stick to it. What exactly is a 'single pulse toolbox'? And does this need to be high performance for FRB searches, or is useability more important?
Hi @telegraphic,
xarray seems to be the perfect replacement for subclassing ndarray. However, one of the issues I found is in the implementation. I like the existing structure of having methods added to the array itself. We can also pipe together methods, e.g
tim = TimeSeries(data)
tim.downsample().pad().toDat()
I don't see a safe way to subclass xarray (docs discourage subclassing). Instead, they suggest using accessors to extend new methods. I am not able to figure out a clean API design using accessors. Another approach would be to wrap the xarray inside the object, and the array can then be accessed by tim.data
.
For FFT, I think the better option will be to use numpy FFT and switch to pyFFTW (if available), thus removing FFTW3 as a dependency in this package.
Re sigproc headers, yes, I agree. So, one idea could be to move all sigproc and psrfits related functions to the io
module and have a common Header
class with classmethods
which can be used as:
hdr = Header.from_sigproc()
hdr = Header.from_psrfits()
We still use some of the sigproc cmd tools, e.g., header
, seek
, dedisperse
, etc. Should we aim to provide these tools in this package? I got interested in the idea of updating sigproc
to modern standards in sigproc2.
So, my intention behind a single pulse toolbox is to have a reference place of robust methods to simulate and analyze single pulses/FRBs. For example, What is the S/N of a burst? There are different implementations in search packages, and to understand the S/N reported in the literature, we always have to refer to the package code. But there should be one robust way to calculate S/N in this toolbox (e.g., matched filtered approach @vivgastro). What is the optimized DM of the burst? same argument again, we can have a pdmp and DM_phase methods to get a robust estimation. How to simulate a burst the right way? So, basically, I am looking for a python-based implementation of psrchive targeted towards FRBs using the profile and block class. We can have support for polarizations and a robust TOA estimation. These numbers can then be reported in literature and TNS to have a uniform definition.
I now think this demands a separate package, which should have the block and profile class and most of the psrchive methods. We can then also have a standard format to store these data (similar to .ar
files).
It seems you came to the same conclusion about xarray! tim.data
seems reasonable to me.
Using from_sigproc
and from_psrfits
for the header sounds good to me.
PyFFTW (or the cupy FFT) are easy replacements.
I love that sigproc2 is PravirK's updates to Evan Keane's fork of Michael Keith's release of Duncan Lorimer's original SIGPROC. Maybe some similar command line tools but with different names to avoid name collisions? e.g. spt_seek
and spt_header
.
Adding 'publication quality plots' to the list as you didn't explicitly mention it 📈
ok, I have moved the single-pulse toolbox idea (block/profile, psrchive, simulate, plots) to a different package burstpy for now.
As we expand the sigpyproc codebase with other formats, a restructuring is required (to v0.6 with #16). Based on this discussion, this is my roadmap for sigpyproc v1.0.
PSRFITS
search-mode (using astropy )
header in sigpyproc.io.psrfits
and FitsReader
in sigpyproc.readers
PSRFITS
)sigproc
command-line utilities (e.g, seek
) and visualization tools.@telegraphic Re PSRFITS
support, there are already several implementations (your, PulsarDataToolbox, etc). Why re-write things again if the definitions are fixed? Instead, we can add one of the implementations as a dependency.
I like the baseband_tasks.io.psrfits framework (also baseband_tasks.io.hdf5) and will be easy to adapt here; however, it supports only fold-mode. I think the search-mode addition will be easy to implement, so will try for a PR there. All we need is robust wrappers around astropy.io.fits to read header keywords and data.
Hi, I am starting this thread to discuss plans for sigpyproc.
Current work
I am refactoring the code in packaging branch based on PEP8 and the new type hints. Adding more abstraction and moving the dynamic header class to more strictly structured. This is going to break the existing API (functions name changed to lower case, etc.). Another addition would be to refactor some of the existing code into 3 classes
profile
(for 1D pulse profile),block
(for 2D freq-time spectrum) andcube
(for folded data) similar to psrchive. Also, will be adding robust S/N estimation (using pdmp approach).Future work
FRB simulator
I have plans to integrate @vivgastro Furby as a class inside sigpyproc (with some additional features and support for UWL-like frequency bands). This will complete sigpyproc as a Single-pulse toolbox in the sense that it can generate data/pulses as well as search, visualize and measure properties of those pulses.
PSRFITS support
As @telegraphic suggested, it would be nice to support other formats (e.g.,
PSRFITS
,HDF5
). I think we can add support to read those formats into the existing sigpyproc framework. I am not sure if we should also have a unified header (e.g., all PSRFITS keywords) or a writer class as all these formats (at leastsigproc
andPSRFITS
) are completely different. Also, there are existing packages like your working towards this. IMO we should keep the header keywords (~25) defined in thesigproc
docs as the base of this package and read other format files into this framework.For example, we can have
from sigpyproc.Readers import FitsReader
with all the functionalities ofFilReader
.Roadmap
Should we move towards an entirely python-based package? Most of the C++ code (running mean/median, FFT) can easily be accessed using
NumPy
andSciPy
. One issue might be the speed and multi-threading, but it can be compensated using theNumba
. @telegraphicWe can revive the FRBs/sigproc project to have a fully modern C++ and sigpyproc-like object-oriented framework with proper documentation. The codebase there is very old and can be easily condensed using modern third-party libraries. @evanocathain