aodn / IMOS-hackathon

Code emerging from the 2024 AODN Hackathon
GNU General Public License v3.0
0 stars 3 forks source link

Explore metadata limitations with Parquet format #26

Open BecCowley opened 6 months ago

BecCowley commented 6 months ago

Global attributes from netcdf files don't get carried through to parquet format. What are the implications for the user with loss of metadata with parquet formats?

BecCowley commented 6 months ago

Can we transform the metadata to data? I mean, treat metadata as data - same as we treat LATITUDE and LONGITUDE, we also transform things like probe types, instrument types, recorder type, institute etc into data.

BecCowley commented 6 months ago

https://gmd.copernicus.org/preprints/gmd-2021-138/gmd-2021-138.pdf A paper on how large netcdf files are transferred to parquet format. They handle the global attributes by inclusion of an additional file.

lbesnard commented 6 months ago

for a bit more explanation on this, the metadata currently written in the parquet sidecar is the metadata of the dataset, not the metadata of the original input NetCDF files. This creates an "issue" (very similar to what we had anyway with the data stored in PostGreSQL) where specific netcdf metadata would be lost. This means the parquet format is "lossy" compared to the orginal NetCDF files. For example with the Glider data, how to store this kind of information:

    "PLATFORM": {
      "type": "string",
      "trans_system_id": "Irridium",
      "positioning_system": "GPS",
      "platform_type": "Slocum G2",
      "platform_maker": "Teledyne Webb Research",
      "firmware_version_navigation": 7.1,
      "firmware_version_science": 7.1,
      "glider_serial_no": "416",
      "battery_type": "Alkaline",
      "glider_owner": "CSIRO",
      "operating_institution": "ANFOG",
      "long_name": "platform informations"
    },
    "DEPLOYMENT": {
      "type": "string",
      "deployment_start_date": "2015-10-21-T05:00:02Z",
      "deployment_start_latitude": -18.9373,
      "deployment_start_longitude": 146.881,
      "deployment_start_technician": "Gregor, Rob",
      "deployment_end_date": "2015-10-27-T01:56:23Z",
      "deployment_end_latitude": -19.2358,
      "deployment_end_longitude": 147.5188,
      "deployment_end_status": "recovered",
      "deployment_pilot": "pilot, CSIRO",
      "long_name": "deployment informations"
    },
    "SENSOR1": {
      "type": "string",
      "sensor_type": "CTD",
      "sensor_maker": "Seabird",
      "sensor_model": "GPCTD",
      "sensor_serial_no": "9117",
      "sensor_calibration_date": "2013-09-17",
      "sensor_parameters": "TEMP, CNDC, PRES, PSAL",
      "long_name": "sensor1 informations"
    },
    "SENSOR2": {
      "type": "string",
      "sensor_type": "ECO Puck",
      "sensor_maker": "Wetlabs",
      "sensor_model": "FLBBCDSLC",
      "sensor_serial_no": "3345",
      "sensor_calibration_date": "2013-10-07",
      "sensor_parameters": "CPHL, CDOM, VBSC",
      "long_name": "sensor2 informations"
    },
...
mhidas commented 6 months ago

@lbesnard I think what @BecCowley is talking about is adding some global attributes from the original NetCDF files into columns in the Parquet product. I'm pretty sure you talked about this as being possible with your code, you just have to configure it to do it, right?

mhidas commented 6 months ago

https://github.com/aodn/aodn_cloud_optimised/blob/main/README_add_new_dataset.md#global-attributes-as-variables

@BecCowley you just need to specify which global attributes we should be addding