astrogo / fitsio

fitsio is a pure-Go package to read and write `FITS` files
BSD 3-Clause "New" or "Revised" License
53 stars 24 forks source link

Is it possible to read only the Header Unit and not the Data Unit? #30

Open airnandez opened 6 years ago

airnandez commented 6 years ago

I'm evaluating astrogo/fitsio for scanning the header units of hundreds of thousands of FITS files. I'm not interested in the contents of the data units.

I noticed that the method DecodeHDU reads not only the Header Unit but also its associated Data Unit.

This behavior is inconvenient for my use case. The structure of each one of the 800+K FITS files I need to scan is basically:

Is there any way to tell astrogo/fitsio not to consume (i.e. read from disk) the data unit (i.e. 18MB in this particular case) but only the header unit (i.e. 14 KB in this particular case)? Given the number and the structure of the files I need to scan, that would greatly help.

sbinet commented 6 years ago

right now it doesn't seem to be advisable with the current decoder (ie: fitsio.streamDecoder).

to implement this, we'd need to implement the fitsio.seekDecoder that can go back to where the data (associated with the header) was when somebody actually requests the data. one also needs to slightly modify the imageHDU and other HDUs to handle the case where data is "lazily" loaded (or loaded just-in-time.)

I'll try to implement this on my way back to Clermont-Fd (tuesday night). unless you beat me to it :)

sbinet commented 6 years ago

I have just pushed a branch lazy-image-loader that tries to address this, at least for the Image HDU. could you try it out and report back?

I think I must try harder to provide a better API/interface for this though...

airnandez commented 6 years ago

I tested branch lazy-image-loader and the result for me is the same: DecodeHDU() still requires the underlying io.Reader to read the full file contents.

I tested that by using the Reader below:

type UnbufferedReader struct {
    f     *os.File
    count int64
}

// NewUnbufferedReader returns a new UnbufferedReader object
func NewUnbufferedReader(f *os.File) *UnbufferedReader {
    return &UnbufferedReader{
        f: f,
    }
}

// Read implements io.Reader
func (ur *UnbufferedReader) Read(p []byte) (int, error) {
    count, err := ur.f.Read(p)
    if err == nil {
        ur.count += int64(count)
    }
    return count, err
}

// Close implements io.Closer
func (ur *UnbufferedReader) Close() error {
    fmt.Printf("Close: [%s] total bytes read: %d\n", ur.f.Name(), ur.count)
    return ur.f.Close()
}

I call fitsio.Open() as follows:

...
file, err := os.Open(path)
if err != nil {
     return err
}

f, err := fitsio.Open(NewUnbufferedReader(file))

Passing to fitsio.Open an analogous io.ReadSeeker also implies reading the underlying file entirely.

sbinet commented 6 years ago

Ok... It occurred to me I only implemented it for image hdu. Do you have other hdu in there?

(In any case, I'll try with your reader impls. On Monday. I am on holidays, starting now.)

airnandez commented 6 years ago

The file I'm testing with is actually an image and fitsio correctly detects that, which I know by printing the type of the header. There is a single HDU in that file.

Below is an excerpt of the FITS cards of my test file:

SIMPLE = true
BITPIX = 16
NAXIS = 2
NAXIS1 = 2144
NAXIS2 = 4241
EXTEND = false
BUNIT = ADU
BSCALE = 1
BZERO = 32768
BLANK = -32768
TIMESYS = UTC
...
sbinet commented 6 years ago

ok. I've found the underlying issue. the DecodeHDU method does correctly detect the io.Seeker and does correctly,lazily decode the Data unit. and after the first HDU, the number of bytes read does correspond to the number of bytes that makes up a header (14400). it's when the decoder tries to go to the next HDU: the attempt at parsing the header iterates over all the data...

let me fix that.

sbinet commented 6 years ago

ok. done.

I am still not very happy with the way I have implemented the lazy loading. I am pondering on whether I shouldn't split the implementation of the various HDU into a HU and a DU part (HU is already factored out as Header. we could do the same for the DU part) so the actual loading of the DU part could happen when doing du := hdu.Data().(*fitsio.Table)

airnandez commented 6 years ago

I just tested with branch lazy-image-loader. When passing an io.Seeker as the argument to fitsio.Open(), DecodeHDU() correctly consumes the part of the FITS file which contains the header unit and does not consume the data unit.

However, when using an io.Reader the whole HDU is consumed even if I'm only interested in retrieving the header. I suspect this is intentional.

In both cases, this different behavior deserves to be documented, so that as a user I understand the tradeoffs of using io.Seeker vs. io.Reader for consuming a FITS file.

airnandez commented 6 years ago

Regarding the API, it would be clearer indeed to separate the Header part from the Data part so that one can retrieve them separately. Something like:

f := fitsio.Open(...)
hdr := f.HDU(0).Header()

// actually load the data unit of this HDU
data := f.HDU(0).Data()
sbinet commented 6 years ago

Wrt the API: yes, that's what I was thinking as well.

Wrt the doc: yep, that also needs to be properly documented.

Wrt the io.Reader it's indeed the intended behavior: an io.Reader can only consume blocks of data, no seeking can be achieved.