esheldon / fitsio

A python package for FITS input/output wrapping cfitsio
GNU General Public License v2.0
133 stars 57 forks source link

Can I use fitsio to loop quickly over 20k small fits-file? #335

Open Nestak2 opened 2 years ago

Nestak2 commented 2 years ago

Hi, I need to extract information from a few columns in ~20k different fits files. Each file is relatively small, ~0.2MB. I have been doing this so far with a loop and astropy like this

from astropy.io import fits

data = []
for file_name in fits_files_list:
    with fits.open(file_name, memmap=False) as hdulist:
        lam = np.around(10**hdulist[1].data['loglam'], 4)
        flux = np.around(hdulist[1].data['flux'], 4)
        z = np.around(hdulist[2].data['z'], 4)
    data.append([lam, flux, z])

This takes for the 20k fits files ~2.5 hours and from time to time I need to loop through the files for other reasons. So I wanted to minimize the time for that and I tried out fitsio in this way:

import fitsio
from fitsio import FITS,FITSHDR

for file_name in fits_files_list[:300]:
    hdulist=fitsio.FITS(file_name)
    lam = np.around(10**hdulist[1]['loglam'][:], 4)
    flux = np.around(hdulist[1]['flux'][:], 4)
    z = np.around(hdulist[2]['z'][:], 4)
    data.append([lam, flux, z])

But unfortunately, it doesn't give me much of a time improvement, if at all. So my question is: Can I improve the time for looping with fistio? Do you know of other packages that would help me? Or do you know if I can change my algorithm to make it run faster, e.g. somehow vectorize the loop? Or some software to stack quickly 20k fits files into one fits-file (TOPCAT has no function that does this for more than 2 files)? Thanks for any ideas and comments!

esheldon commented 2 years ago

It might be good to profile this, to see if it is limited by reading from disk.

If it is read limited, then the best way to speed it up would be to run multiple jobs on different machines and combine the results afterward.