Handle data length inconsistencies in BOSS spec files

dcunning11235 commented 9 years ago

Some spec files (see spec-3587-55182-xxxx.fits for 0002, 0038, 0084, 0094, 0102 as examples) are missing the first (have spot-checked a handful only, could be others) column of spectrum data, relative to the other spec files for their own plate/mjd. These example spec files (seem) to all have 4618 or 4619 wlen bins.

Also, the corresponding spPlate file has 4631 columns/bins, so there has been some loss in spPlate --> spec, generally.

Overall, spPlate files have some variable number of columns, seemingly in the range of 4620's - 4630's. Spot checking a handful of these in the plate ID range 3500-6000 shows that all their corresponding spec files have (a) some bins been dropped/fewer columns from spPlate --> spec; and (b) the spec files for the same plate vary in their column count (6000-56102 has some more examples with larger differences .)

I think it is definitely worthwhile to try to present spec files for a given plate have a consistent number of columns.

It would be nice to have spec files match up with their spPlate file (though arguably if you're using both you should just switch to one or the other.)

I wonder if it is worthwhile to also attempt to normalize spPlates at some level. spPlate should stay as is, but having a slightly higher level wrapper, perhaps a "PlateSet" that could present plates with their wlen buckets aligned (as closely as possible) with masked/empty values inserted/appended as needed. Basically, to handle some of the (what will be) boilerplate for comparing spectra across plates/mjds.

dkirkby commented 9 years ago

The script that builds the spec and spec-lite files strips leading and trailing ivar=0 bins. Is that consistent with what you are finding, i.e., do the extra bins in spPlate have ivar=0?

I am more puzzled by the varying number of rows (I think of 'flux', 'ivar', etc as the columns) in spPlate files. The number of rows within a plate should always be the same (since spectra are stored as a 2D array of wavelength records), so I guess you are seeing different dimensions in different plates?

dcunning11235 commented 9 years ago

Yes, re: ivar=0; that does look like what is happening.

And yes, different dims in different plates, by small-ish amounts, e.g.

Plate	row count	wlen first pixel
spPlate-3586-55181.fits	4638	3.5517
spPlate-3587-55182.fits	4631	3.5522
spPlate-3924-55332.fits	4639	3.5533
spPlate-6000-56102.fits	4626	3.5546

Checked against headers in FITS files,this is what they list too.

dkirkby commented 9 years ago

Its annoying that the BOSS data has these small inconsistencies, but the philosophy of this package is mostly to provide the data "as-is" with thin wrapper classes around each type of FITS file.

With this in mind, I suggest that we add an attribute pixel_offset that is calculated in the constructors of SpecFile and PlateFile such that flux[i + pixel_offset] always refers to the same wavelength bin. I believe this is all that is needed to combine and compare arrays with different offsets.

dcunning11235 commented 9 years ago

Alright, let me figure out what the lowest wlen is; seems like 3.5 is safe, but since the increment is 10^-4 a better estimate would be nice.

Hard coding this does need a safe value for any of the surveys/catalogs. I imagine it doesn't change dramatically, but it is another test point.

dmargala commented 9 years ago

I believe the fiducial starting wavelength is 3500.26 ang. I have implemented this sort of stuff in other places but I am busy for the next few days. On Jul 2, 2015 5:24 PM, "dcunning11235" notifications@github.com wrote:

Alright, let me figure out what the lowest wlen is; seems like 3.5 is safe, but since the increment is 10^-4 a better estimate would be nice.

Hard coding this does need a safe value for any of the surveys/catalogs. I imagine it doesn't change dramatically, but it is another test point.

— Reply to this email directly or view it on GitHub https://github.com/dkirkby/bossdata/issues/67#issuecomment-118201428.

dmargala commented 9 years ago

This might need a little cleaning up.

def get_fiducial_pixel_index_offset(coeff0, coeff1=1e-4, fid_lambda=3500.26):
        """
        Returns the pixel index offset from the start of the BOSS co-add fiducial wavelength grid.
        If more than 1% away from a fiducial pixel index, returns the exact offset (in pixels).

        Args:
            coeff0 (float): central wavelength (log10) of first pixel
            coeff1 (float, optional): log10 spacing between pixel centers

        Returns:
            pixel index offset from the start of the fiducial wavelength grid.
        """
        delta = (np.log10(fid_lambda)-coeff0)/coeff1
        # round to nearest pixel index
        offset = np.floor(delta+0.5).astype(int)
        # if more than %1 away from pixel index, return exact offset (in pixels)
        if np.any(np.fabs(delta-offset) > 0.01):
            return -delta
        # otherwise return nearest pixel index
        return -offset

dcunning11235 commented 9 years ago

@dkirkby Finally looked at this some more and pixel_offset would work for PlateFile, but I think would require altering the SpecFile.get_valid_data method as this is where spec data is actually handled. Also, within a single spec file the exposures can have slightly different starting wavelengths, enough to end up in different bins: 3586-55181-0001, exposure 5 has staring wavelength (masked or unmasked) at 3561.5444 while all other exposures start with 3561.7224-3562.4185, putting the first wlen for exposure 5 in bin 75 and the first wlen for the rest in bin 76; so I don't think this can be an instance var for SpecFile.

Rather than write into the classes at all, what about this: Add the above (modified slightly) method from @dmargala to spec.py (module level, not class) along with a second method that takes spectra masked arrays and returns a new ma shifted, padded, and masked to the correct offset?

EDIT: like so: https://github.com/dkirkby/bossdata/compare/%2367

dkirkby commented 9 years ago

It looks like you are copying the entire data array just to avoid keeping track of each spectrum's offset. I was hoping that we could provide the same convenience without actually copying everything, e.g., with a method that returns a view rather than a copy. It would be useful to have a concrete use case for comparing these approaches so we can measure the copy overhead and figure out what the non-copying API would look like. The simplest use case I can think of is adding two combined spectra.

dkirkby commented 9 years ago

I just added the following to the spec module on my branch for #86:

def get_fiducial_pixel_index(wavelength, coeff=1e-4, log10lam0=np.log10(3500.26)):
        """
        Convert a wavelength to a fiducial pixel index.

        Args:
            wavelength(float): Input wavelength in Angstroms.
            coef(float): Linear coefficient used for the conversion.
            log10lam0(float): log10 of the wavelength (in Angstroms) corresponding
                to pixel index zero.

        Returns:
            int: Index relative to the fiducial wavelength grid. If the input wavelength
                is not exactly on the fiducial grid, the result will be rounded in
                log10(wavelength).
        """
        return np.round((np.log10(wavelength) - log10lam0)/coeff).astype(int)

Did someone already figure out the maximum pixel index that would cover any SpecFile or PlateFile spectrum?

dmargala commented 9 years ago

I usually use 3500.26*10**(1e-4*np.arange(4800)) to get a fiducial wavelength grid that will fit all observations.

dkirkby / bossdata

Handle data length inconsistencies in BOSS spec files #67