Header metadata and DB schema definitions

DinoBektesevic commented 3 years ago

The biggest blocker to moving forward in building out the query and image metadata handling is the fact that there are no explicit decisions on what header metadata we want to, or even can, keep.

I assume it's safe to assume that if we use Astropy WCS class we can reliably define a standard interface to standard WCS information for a wide array of instruments. The remaining keywords are there for some, not there for others.

Even a provisional list of header keywords we want to track and standardize on is a start, even if those keywords are not always present for all instruments, but without at least that I can't map any header to a table schema and I certainly have a much harder time designing a query interface.

mrawls commented 3 years ago

Critical metadata: observer location (lon and lat), sky position (ultimately WCS, but center RA and Dec plus a pixel scale probably acceptable), date and time, exposure duration, band and/or filter, and whether the exposure is raw or processed in some way.

DinoBektesevic commented 3 years ago

As mentioned in the draft pr https://github.com/dirac-institute/trailblazer/pull/10 - standardizing keywords we we want to store from the header (in searchable format) sets our database schema and query interface and is currently a high priority issue as it directly affects the core goals of the project.

The astro_metadata_translator enables us to translate and standardize all header keywords for the following instruments:

SDSS
DECam
HSC
MegaPrime
Subaru
SuprimeCam

as well as enables us to standardize the following keywords for "generic" instrument (or at least most of them):
DATE-OBS
INSTRUME
TELESCOP
OBSGEO-[X,Y,Z]

We were hoping we can utilize Astropy WCS class to standardize our projection and keywords between all kinds of different official WCS standards but ran into some difficulties.

If that can be made to work we will have cca 80% of the keys mentioned by Meredith (exposure duration, band/filter might suffer).

DinoBektesevic commented 3 years ago

I think https://github.com/dirac-institute/trailblazer/pull/10 draft is starting to look pretty good.

There's still some work to do:

tests of what is in there right now
debug with various files (not just SDSS)
add an upload processor for .tar.bz and the variants
Dealing with errors (my knee-jerk reaction is to upload them somewhere else and used them later for debugging)
logging

There's still the general transition to boto3 and RDS that is required but I really think that for that I'd like to have a DBConfig class done so I think I'll focus on that PR now and implement comments I got. If people like this so far we can move it to a proper PR status?

DinoBektesevic commented 3 years ago

I feel like the latest iteration of PR https://github.com/dirac-institute/trailblazer/pull/10 warrants an update on the issue here to make it explicitly recorded in non-pythonic language as well. Bolded are table names, italicized are non-nullable values.

We currently have the following schema:

UploadInfo

id - auto ID
created - timestamp of the upload date and time
ip - originating IP

Each upload can consist of many individual FITS files.

Metadata

id - auto ID
upload - foreign key, upload ID
_processorname
_standardizername
_obslon
_obslat
_obsheight
_datetimebegin
_datetimeend
instrument
telescope
science_program
exposure_duration
filter_name

Each FITS file will get 1 metadata entry in the DB so the relationship between Metadata and UploadInfo is many-to-one

Wcs

id - auto ID
metadata - foreign key, metadata ID
_wcsradius
_wcs_centerx
_wcs_centery
_wcs_centerz
_wcs_cornerx
_wcs_cornery
_wcs_cornerz

Each individual Metadata entry can have multiple Wcs entries - making the Wcs to Metadata many-to-one relationship again. This is because some FITS files come as multi-extension FITS files where each extension is a particular CCD in the entire focal plane. The observer location, observing time, instrument, filter etc. are then shared properties of all these WCS entries, but each CCD will have it's own unique WCS information.

Thumbnails

id - auto ID
wcs - foreign key, wcs id
large - location of large image
small - location of small image

Some WCS will all share the same end thumbnail, like the whole focal plane thumbnails we create for DECam, but in principle this is not a general rule. FITS files can contain any number of related and unrelated HDUs that can each be its own thumbnail. In fact the latter case probably occurs more often than the DECam case does. This is why I set the relationship from thumbnails to WCSs as one-to-one and I unroll the DECam thumbnails into individual entries, even though that duplicates some data.

Confusing bits

Thinking about the schema now, I'm unsure what the difference between instrument and telescope is. The package astro_metadata_translator makes a difference between the two as the actual imaging instrument (f.e. the camera name) and the instrument (f.e. SDSS telescope) but that package also does not have to have as general as requirements as were laid out for trailblazer. I am slightly concerned that most of the time we won't be able to discern between the two and the FITS files will have either the telescope (header key TELESCOP I believe) or the instrument (header key INSTRUME I believe). Perhaps standardizing on a combination of instrument+telescope under some oter name like observer or some such is best and then we can keep individual instrument and telescope entries separate but nullable in cases in which they are not applicable. Perhaps we can check for some other additional keys like OBSERVER or some such to construct this observer entry. I don't actually know what other identifying info people put into headers but if there are some common non-standard ones we can add them.

Second thing I'm not sure of is whether we should make datetime_start and datetime_end mandatory or if we should make datetime_start and exposure mandatory. In any case there will be cases when one or the other will have to be calculated from available data (i.e. we have datetime_start and exposure or we have start and end but no exposure) and whether we should be doing this math or not? If not we should not make the datetime_end mandatory, but we should probably still insist on having at least the start date and time.

I think the schema might change in the future, but with these last two questions answered and implemented we should close this issue.

mrawls commented 3 years ago

Thank you for laying all this out for clarity in plain text! Check out the comment above @Jasminewatt2100.

To address your confusing bits. Indeed, Rubin/LSST is picky about distinguishing telescope (e.g., the Blanco 4-m) from instrument (e.g., a specific camera, like DECam). I think I'd like to keep the fields separate, but we could require just one if the translator + standardizer can't find both. The header I was just looking at also has SITEID and SITE as keywords, which might be useful. I don't want to get in a situation where we have a bunch of, say, "Kitt Peak" observations which are actually from 3 different telescope+instrument combos though.

Regarding time, I'm inclined to keep all three fields but only require two (any two), if that's not too tricky to implement. Thank you again for all your work on this important PR!

DinoBektesevic commented 3 years ago

I don't want to get in a situation where we have a bunch of, say, "Kitt Peak" observations which are actually from 3 different telescope+instrument combos though.

Ah, that's a fair point. If we had a field like that it would be a complete mess of values. I just wanted to clarify that my idea was not to not have separate columns like instrument and telescope. Instead I want to have instrument, telescope and add a new field.

Really what I want is two things, something like an observer field and something like a specialized_data_id field.

What I want is a way to somehow record whatever generic community accepted ID for the data. For example, for SDSS it's dataSchema-filter-run-camcol-field.fits such as frame-i-000094-5-0001.fits, or like for MOA the format is sort of RUN-FIELD-FILTER-CHIP which is hot the files are named as well: A3671-C2018_F4-R-3.fit. NOAO Science Archive has a whole scheme by which they arrive at the names of the files in their data archive. None of this really necessarily identifies the instrument, telescope or science_program, but yet for some users, that are more intimately familiar with the data source, it provides all the required information to trace back and extract additional metadata, by string manipulation if by nothing else.

So it doesn't have to be observer kind of field but something from which someone would be able to ask something like get "observer" if SDSS in processor_name or SDSS in standardizer_name and get back a mish-mash like SDSS Sloan Digital Sky Survey 1.8m telescope SDSS Imager Stripe 84 frame-.... whateverConstructedExtraID. For Sloan we can set the individual instrument, telescope etc. as well. But, for example, for MOA, there is not good id for science_program even though there is a way to seemingly uniquely identify the exact image in MOA-speek.

For observer, take, for example, that in the stretch goal what we would want is to allow anyone to submit a FITS file, perhaps even made in their back-yard. If that FITS contained OBSERVER or OBSVAT or ORIGIN we would be very lucky indeed. But if they don't and we just say observer=independent if not any(keys) then we record at least an extra bit of information than we do now.

So yeah, it's a trashbox, but it's at least a trashbox we can guarantee to exist every-time, and for which we can say can contain a unique identifier of the observer or source in the format that observer or community have standardized on - or generic. Or something similar if that makes sense.

mrawls commented 3 years ago

it's at least a trashbox we can guarantee to exist every-time, and for which we can say can contain a unique identifier of the observer or source in the format that observer or community have standardized on

😂 got it - sounds like a plan!

dirac-institute / trailblazer

Header metadata and DB schema definitions #7

Confusing bits