Open BecCowley opened 6 months ago
Can we transform the metadata to data? I mean, treat metadata as data - same as we treat LATITUDE and LONGITUDE, we also transform things like probe types, instrument types, recorder type, institute etc into data.
https://gmd.copernicus.org/preprints/gmd-2021-138/gmd-2021-138.pdf A paper on how large netcdf files are transferred to parquet format. They handle the global attributes by inclusion of an additional file.
for a bit more explanation on this, the metadata currently written in the parquet sidecar is the metadata of the dataset, not the metadata of the original input NetCDF files. This creates an "issue" (very similar to what we had anyway with the data stored in PostGreSQL) where specific netcdf metadata would be lost. This means the parquet format is "lossy" compared to the orginal NetCDF files. For example with the Glider data, how to store this kind of information:
"PLATFORM": {
"type": "string",
"trans_system_id": "Irridium",
"positioning_system": "GPS",
"platform_type": "Slocum G2",
"platform_maker": "Teledyne Webb Research",
"firmware_version_navigation": 7.1,
"firmware_version_science": 7.1,
"glider_serial_no": "416",
"battery_type": "Alkaline",
"glider_owner": "CSIRO",
"operating_institution": "ANFOG",
"long_name": "platform informations"
},
"DEPLOYMENT": {
"type": "string",
"deployment_start_date": "2015-10-21-T05:00:02Z",
"deployment_start_latitude": -18.9373,
"deployment_start_longitude": 146.881,
"deployment_start_technician": "Gregor, Rob",
"deployment_end_date": "2015-10-27-T01:56:23Z",
"deployment_end_latitude": -19.2358,
"deployment_end_longitude": 147.5188,
"deployment_end_status": "recovered",
"deployment_pilot": "pilot, CSIRO",
"long_name": "deployment informations"
},
"SENSOR1": {
"type": "string",
"sensor_type": "CTD",
"sensor_maker": "Seabird",
"sensor_model": "GPCTD",
"sensor_serial_no": "9117",
"sensor_calibration_date": "2013-09-17",
"sensor_parameters": "TEMP, CNDC, PRES, PSAL",
"long_name": "sensor1 informations"
},
"SENSOR2": {
"type": "string",
"sensor_type": "ECO Puck",
"sensor_maker": "Wetlabs",
"sensor_model": "FLBBCDSLC",
"sensor_serial_no": "3345",
"sensor_calibration_date": "2013-10-07",
"sensor_parameters": "CPHL, CDOM, VBSC",
"long_name": "sensor2 informations"
},
...
@lbesnard I think what @BecCowley is talking about is adding some global attributes from the original NetCDF files into columns in the Parquet product. I'm pretty sure you talked about this as being possible with your code, you just have to configure it to do it, right?
@BecCowley you just need to specify which global attributes we should be addding
Global attributes from netcdf files don't get carried through to parquet format. What are the implications for the user with loss of metadata with parquet formats?