GreenBankObservatory / dysh

https://dysh.readthedocs.io
Other
9 stars 3 forks source link

Switch to fitsio? #419

Open astrofle opened 3 hours ago

astrofle commented 3 hours ago

Feature description dysh is slow (as of 0.4.0). We need to make dysh faster.

Could switching to fitsio help with some of that? Also, fitsio supports reading row ranges -- which astropy doesn't. This could be useful for an online mode.

A quick test shows fitsio may be faster at reading data (taken from a notebook):

import numpy as np
import pandas as pd

import fitsio
from astropy.io import fits
from dysh.util import get_project_testdata

filename = get_project_testdata() / "AGBT20B_014_03.raw.vegas/AGBT20B_014_03.raw.vegas.A6.fits"

%%timeit -n 10
hdu = fits.open(filename)
df = pd.DataFrame(np.lib.recfunctions.drop_fields(hdu[1].data, "DATA"))
    36.1 ms ± 5.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
data = fitsio.read(filename)
df = pd.DataFrame(np.lib.recfunctions.drop_fields(data, "DATA"))
    13.4 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That is almost a third of the time.

teuben commented 3 hours ago

every little bit helps, but we need to see this in the "bigger" picture. If we have a large getps() - and we do have a benchmark. In issue #279 I did some work, totally unfinished, but telling, about a benchmark. LIke with all good software, we should define a benchmark that a user can run, and verify the result, e.g. in the output of the statistics of the spectrum to many digits. I've advocated this "qac" technique as an option to the Spectrum.stats() function.

I'm actually surprised, I've naively thought fits.open() would use fitsio.