FRBs / sigpyproc3

Python3 version of Ewan Barr's sigpyproc library
https://sigpyproc3.readthedocs.io
MIT License
14 stars 11 forks source link

Roadmap discussion #15

Open pravirkr opened 3 years ago

pravirkr commented 3 years ago

Hi, I am starting this thread to discuss plans for sigpyproc.

Current work

I am refactoring the code in packaging branch based on PEP8 and the new type hints. Adding more abstraction and moving the dynamic header class to more strictly structured. This is going to break the existing API (functions name changed to lower case, etc.). Another addition would be to refactor some of the existing code into 3 classes profile (for 1D pulse profile), block (for 2D freq-time spectrum) and cube (for folded data) similar to psrchive. Also, will be adding robust S/N estimation (using pdmp approach).

Future work

FRB simulator

I have plans to integrate @vivgastro Furby as a class inside sigpyproc (with some additional features and support for UWL-like frequency bands). This will complete sigpyproc as a Single-pulse toolbox in the sense that it can generate data/pulses as well as search, visualize and measure properties of those pulses.

PSRFITS support

As @telegraphic suggested, it would be nice to support other formats (e.g., PSRFITS, HDF5). I think we can add support to read those formats into the existing sigpyproc framework. I am not sure if we should also have a unified header (e.g., all PSRFITS keywords) or a writer class as all these formats (at least sigproc and PSRFITS) are completely different. Also, there are existing packages like your working towards this. IMO we should keep the header keywords (~25) defined in the sigproc docs as the base of this package and read other format files into this framework.

For example, we can have from sigpyproc.Readers import FitsReader with all the functionalities of FilReader.

Roadmap

Should we move towards an entirely python-based package? Most of the C++ code (running mean/median, FFT) can easily be accessed using NumPy and SciPy. One issue might be the speed and multi-threading, but it can be compensated using the Numba. @telegraphic

We can revive the FRBs/sigproc project to have a fully modern C++ and sigpyproc-like object-oriented framework with proper documentation. The codebase there is very old and can be easily condensed using modern third-party libraries. @evanocathain

telegraphic commented 3 years ago

Hey Pravir, finally found some time to respond to this!

I am pleased you are thinking about this. The FRB community needs a really solid software package that is well equipped for the future. So I think a refactor is well motivated, but I'm actually wondering whether it's better to start a completely new package, and port the useful code over to it?

Data arrays

I love the idea of having a profile, block and cube class. Can these support multiple beams and polarization too? I recently played around with xarray, which I conceptually like the idea of as a base data structure -- from which you could define a profile, cube and block. .

However there are two issues I've found for high resolution data: 1) The 'coords' require full numpy arrays, not just a start value and step size. Here's a github discussion I started which nobody could answer. 2) I really think units should be built in, similar to astropy's unit arrays (although pint-xarray tries to add this). 3) No support for 1,2 or 4-bit data. I really like how h5py and np.memmap() provide access to data, but the need to support low bitwidth means you can't use straightforward memory mapping.

I got a bit carried away and tried to roll my own xarray-like DataArray array class in a pet project called hyperseti, the method to read from a filterbank is here. This got much messier than I anticipated so not sure it's the best solution! I do however think the overall xarray concept is very attractive -- your data should have labels that describe it well (I would add units too!).

As well as profile/block/cube, should it support voltage-level products? Perhaps we could learn from baseband, which looks to have a task framework including dedispersion. (I'm not very familiar with these)

Other Dependencies

In terms of unit/coordinates/time handling, I think astropy does a decent job so would be a reasonable dependency. I like the idea of ALL arrays having units attached (i.e. as an astropy.Quantity array). I'm not as big of a fan of numba, but it is very reasonable to use it. I quite like pybind (in preference to cython), but perhaps only used sparingly (e.g. do we really need FFTW3?). cupy is easy for GPU acceleration, but not everyone has an NVIDIA GPU, so only an optional dependency if possible.

Internal data structure

My opinion is that the sigproc header is too dated, and it doesn't support polarization, so I would suggest using a new and improved internal data structure. (Also storing angles as a float inDDMMSS.ss is quite frankly insane). With some thought we could come up with a more suitable list of keywords: e.g I would prefer (f and df) to (fch1 and foff), (t and dt) to (tstart and tsamp), or one could use more general [var]_start and [var]_step, e.g. freq_start, time_start to have consistency.

Next-gen use cases and questions

I guess the most important for adoption is that the package is easy to use for current use cases, and that it can support next-generation use cases too. The UWL and narrow-bandwidth FRBs make a good case study: sub-band searches will probably become more common. How easy would it be to pipe data into tensorflow? How about multiple beams? How about parallelization across nodes (e.g. dask), paralleization on CPUS (openmp) or GPU acceleration (cupy)?

It is easy to have 'feature creep' and make the task huge, so it will be important to set a scope and stick to it. What exactly is a 'single pulse toolbox'? And does this need to be high performance for FRB searches, or is useability more important?

pravirkr commented 3 years ago

Hi @telegraphic,

xarray seems to be the perfect replacement for subclassing ndarray. However, one of the issues I found is in the implementation. I like the existing structure of having methods added to the array itself. We can also pipe together methods, e.g

tim = TimeSeries(data)
tim.downsample().pad().toDat()

I don't see a safe way to subclass xarray (docs discourage subclassing). Instead, they suggest using accessors to extend new methods. I am not able to figure out a clean API design using accessors. Another approach would be to wrap the xarray inside the object, and the array can then be accessed by tim.data.

For FFT, I think the better option will be to use numpy FFT and switch to pyFFTW (if available), thus removing FFTW3 as a dependency in this package.

Re sigproc headers, yes, I agree. So, one idea could be to move all sigproc and psrfits related functions to the io module and have a common Header class with classmethods which can be used as:

hdr = Header.from_sigproc()
hdr = Header.from_psrfits()

We still use some of the sigproc cmd tools, e.g., header, seek, dedisperse, etc. Should we aim to provide these tools in this package? I got interested in the idea of updating sigproc to modern standards in sigproc2.

Single-pulse toolbox

So, my intention behind a single pulse toolbox is to have a reference place of robust methods to simulate and analyze single pulses/FRBs. For example, What is the S/N of a burst? There are different implementations in search packages, and to understand the S/N reported in the literature, we always have to refer to the package code. But there should be one robust way to calculate S/N in this toolbox (e.g., matched filtered approach @vivgastro). What is the optimized DM of the burst? same argument again, we can have a pdmp and DM_phase methods to get a robust estimation. How to simulate a burst the right way? So, basically, I am looking for a python-based implementation of psrchive targeted towards FRBs using the profile and block class. We can have support for polarizations and a robust TOA estimation. These numbers can then be reported in literature and TNS to have a uniform definition.

I now think this demands a separate package, which should have the block and profile class and most of the psrchive methods. We can then also have a standard format to store these data (similar to .ar files).

telegraphic commented 3 years ago

It seems you came to the same conclusion about xarray! tim.data seems reasonable to me.

Using from_sigproc and from_psrfits for the header sounds good to me.

PyFFTW (or the cupy FFT) are easy replacements.

I love that sigproc2 is PravirK's updates to Evan Keane's fork of Michael Keith's release of Duncan Lorimer's original SIGPROC. Maybe some similar command line tools but with different names to avoid name collisions? e.g. spt_seek and spt_header.

Adding 'publication quality plots' to the list as you didn't explicitly mention it 📈

pravirkr commented 3 years ago

ok, I have moved the single-pulse toolbox idea (block/profile, psrchive, simulate, plots) to a different package burstpy for now.

As we expand the sigpyproc codebase with other formats, a restructuring is required (to v0.6 with #16). Based on this discussion, this is my roadmap for sigpyproc v1.0.

pravirkr commented 3 years ago

@telegraphic Re PSRFITS support, there are already several implementations (your, PulsarDataToolbox, etc). Why re-write things again if the definitions are fixed? Instead, we can add one of the implementations as a dependency.

I like the baseband_tasks.io.psrfits framework (also baseband_tasks.io.hdf5) and will be easy to adapt here; however, it supports only fold-mode. I think the search-mode addition will be easy to implement, so will try for a PR there. All we need is robust wrappers around astropy.io.fits to read header keywords and data.