NetCDF Metadata in models.py

njmattes commented 8 years ago

These are merely questions or observations—I'm not sure the best way to handle most of this.

dataset_name: Would it be better to use an automatically incremented integer for the primary_key? Eg, what if two files have the same name? Perhaps a deviant case, but very possible.
Spatial resolution: should we store this explicitly in a column? Or infer it from the latitude and longitude?
Temporal resolution: same question as above.
(Is it fair to assume that each gridded dataset has spatial and temporal resolution?)
If we're inferring the resolutions, we could make those @properties of the NetCDF_Meta class.
Or we could store that sort of 'derived'(?) metadata in another table like we discussed Monday. Like a NetCDF_CleanMeta, in which variable names are standardized and stored. In this table we might also store things such as start and end date of the dataset (if we assume that users may want to search for datasets based on that info).
dataset_name in NetCDF_Data should (I think) be a foreign key to the primary key of NetCDF_Meta. So, if NetCDF_Meta had a primary key id that was auto incremented, NetCDF_Data might have netcdf_meta_id = Column(Integer, ForeignKey('netcdf_meta.id')).

ghost commented 8 years ago

i first indeed planned to have this additional ID as the primary key. the reason i did not take it was though it's easy to generate this ID for the NetCDF_Meta table, it's a little bit (not too much though) work (because of #7) to have this ID also in NetCDF_Data (since we need to have it there in order todo joins)
actually, when i ingest using raster2pgsql it already stores the spatial resolution for me in an extra postgis managed table called raster_columns, a table that knows about all the columns of type raster in an entire database which brings us to the problem that in our case different rows in NetCDF_Data come from different datasets and thus have different resolutions. thus, you are right, we need to store this resolution somewhere else, let us add it as an additional column field in NetCDF_Meta
you are right of course. when i used raster2pgsql to ingest a netcdf's data it just has a notion of bands, i.e. band 1, 2, 3, ... (which are the time frames, which raster2pgsql found out automatically, so it has some cleverness) but it has no idea that a stepping of 1 actually means a day, week, month, or year, etc. yes, so i will add another column to NetCDF_Meta for the temporal resolution
yes
that's how i'm planning todo it, see above
how about i solve it with NetCDF_Meta first and later, after having seen some dirtier netcdf's abstract out stuff into NetCDF_CleanMeta
yes, you're absolutely right, let me do that, as mentioned before

let me add here:

lat / long intervals might not be uniform, same for time. however, for the time being i'm assuming uniform spacing of lat's, long's, and time's and these uniform spacings are the resolutions which i'm storing as additional columns into netcdf_meta

ghost commented 8 years ago

TODO for myself:

extract spatial + temporal resolution from netcdf and add it to NetCDF_Meta as new columns, of course also add it to the corresponding SQLAlchemy object
add an autoincrement primary key id column to NetCDF_Meta and also add a column netcdf_meta_id to NetCDF_Data that's a foreign key to netcdf_meta.id

njmattes commented 8 years ago

Yes, you're totally right that we should stick to fixing NetCDF_Meta now, and wait for NetCDF_CleanMeta until we actually need it.

Another question: are units stored in NetCDF_Meta already? Are those in vars_attrs perhaps?

ghost commented 8 years ago

yes, vars_attrs contains the variables' attributes in the order of the variables and the key first, then followed by the value, i.e.

var_1_key_1, var_1_val_1, var_1_key_2, var_1_val_2, var_1_key_3, var_1_val_3, .... , var_2_key_1, var_2_val_1, var_2_key_2, var_2_val_2, var_2_key_3, var_2_val_3, .... , ....

and the key-value pair(s) for units would be among them. however, i have to point out that the metadata stored in netcdf_meta as it is right now does not allows us to that much efficiently locate the value for a units key.

to get the value of a units key for a particular variable we would have to do the following with the current metadata

display vars_names to the user and he can decide on a specific var among them
loop over vars_names to find the index of that var, let's call it var_index
sum up all entries in vars_attrs_nums up to and excluding index var_index to get, let's call it say, var_attr_index, which is the index of vars_attrs where the key-value attribute pairs of var start, also save vars_attrs_nums[var_index] as say var_num_attrs
loop over vars_attrs starting at var_attr_index and going to at most var_attr_index + var_num_attrs - 1 and search for an entry that equals units and finally return the immediate next entry (which is going to be the value of the units key-value pair for the given variable var the user was interested in)

this might look cumbersome, however, we can of course implement steps 2-4 as a stored procedure in order to avoid sending multiple requests to Postgres. because of that, and also because we usually don't have that many variables (and less importantly that many key-value attributes for a fixed variable) the above procedure shouldn't take too long.

in any case we can of course always optimize later. but at least we would be able to avoid sending multiple requests by using a stored procedure to implement the above and we would likely be using similar stored procedures to implement other convenient metadata searches.

RDCEP / EDE

NetCDF Metadata in models.py #6