ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
84 stars 59 forks source link

Support single griddap dataset with 2d and 1d variables. #178

Open ChrisJohnNOAA opened 3 months ago

ChrisJohnNOAA commented 3 months ago

This needs some investigation on if and how to implement the feature. Original message below:

Pinging @ChrisJohnNOAA is this capacity that @jklymak describes something ERDDAP can support/could support in future?

As I understand, the desired behavior is to have a single griddap dataset which serves 2-D gridded variables like temperature(profile_num, depth_bin) as well as 1-D variables like lat(profile_num), without broadcasting these 1-D variables to 2-D.

At the moment, from reading the docs @rmendels highlighted (In EDDGrid datasets, all data variables MUST use (share) all of the axis variables.), it seems this is not currently supported. So you would have to create two different datasets on your ERDDAP server to achieve this.

Originally posted by @callumrollo in https://github.com/ERDDAP/erddap/discussions/177#discussioncomment-10180612

jklymak commented 3 months ago

Thanks for considering @ChrisJohnNOAA

As noted elsewhere in the discussion, NCEI standard has this format as well: https://www.ncei.noaa.gov/data/oceans/ncei/formats/netcdf/v2.0/index.html, eg https://www.ncei.noaa.gov/thredds-ocean/catalog/example/v2.0/catalog.html?dataset=example/v2.0/NCEI_trajectoryProfile_template_v2.0_2016-09-22_181838.014029.nc has a structure like:

Dimensions:      (trajectory: 1, obs: 10, z: 4)
Coordinates:
  * trajectory   (trajectory) int32 -2147483647
    time         (trajectory, obs) object ...
    lat          (trajectory, obs) float64 ...
    lon          (trajectory, obs) float64 ...
  * z            (z) float64 1.0 2.0 3.0 4.0
Dimensions without coordinates: obs
Data variables:
    sal          (trajectory, obs, z) float64 ...
    temp         (trajectory, obs, z) float64 ...

Also H.6.2 at https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/aphs06.html

This is a super useful way to organize data sets so hopefully it's not too hard to implement.

jenseva commented 3 months ago

Thanks for considering this!

Copying over from the discussion here:

I believe Callum is correct, this is a present limitation in ERDDAP.

Here is an except Bob in 2017: _If a variable in the source file is e.g., lat and lon values that use different dimensions than the main data variables and that convert the projection x,y locations into lat and lon, then in ERDDAP they need to be in a separate dataset.

See https://coastwatch.pfeg.noaa.gov/erddap/download/setupDatasetsXml.html#dataStructures and the subsequent few paragraphs.

I know this sounds goofy and severely limiting. It is the one major situation where using ERDDAP isn't the best choice. But it only affects a small percentage of the total data files in NOAA (that is small consolation for you where it affects perhaps 100% of your data files for polarwatch). There is a solution -- a modification to ERDDAP that would support this but doing it would be a massive effort on my part (a couple of months with no distractions) so I haven't had time to do it.

There was a reason for doing it this way: it is this slightly-simpler-than-netcdf data model that allows ERDDAP to read data from many file types and write data to many file types. So there is great benefit, but it comes at a cost. Few people/groups/datasets pay the cost, but you are. Sorry._

The solution we had to implement at PolarWatch meant there with two datasets which was a bit of a hack and difficult for users.

It would be great to see this feature added to ERDDAP! I agree there is value in having this type of synthesized glider data accessible via griddap over tabledap. The list of benefits is quite long.

Best, Jenn

jcermauwedu commented 2 months ago

@callumrollo

@jklymak pointed out this post to me. Please try and leverage use of EDDTableFromMultidimNcFiles. It is a bit messy, but I did manage to get the active acoustic echograms into ERDDAP combined with the typical glider environmental data (temperature, salinty, ...). I am currently also trying to walk over the NGDAC netCDF-2.0 solution to OG-1.0 using the same dataset. See: https://acoustics.fish.washington.edu/erddap/files/unit_507_20240512T0000/

It is still a work in progress, but a companion dataset will appear that will be the OG cross walked version. Grab me via email or join the conversation on UG2 Slack #data.

The pattern I am attempting to utilize should work for trajectory and profile, files.

I can make example datasets and XML configuration files available as well. Just let me know.

jklymak commented 2 months ago

@jcermauwedu Its hard to see what you mean here from the linked ERDAPP files - they are just usual trajectory files, are they not? What files does EDDTableFromMultidimNcFiles produce?

rmendels commented 2 months ago

@jcermauwedu @jklymak The ERDDAP access is at https://acoustics.fish.washington.edu/erddap/tabledap/index.html?page=1&itemsPerPage=1000. I have slowly been working out the same approach, it is how ERDDAP handles some of the discrete geometry datasets. A point of note in the installation instructions:

"When you look at the dataset's metadata in ERDDAP™, the DSG dataset appears to be in ERDDAP's internal format (a giant, database-like table). It isn't in one of the DSG formats (e.g., the dimensions and metadata aren't right), but the information needed to treat the dataset as a DSG dataset is in the metadata (for example, cdm_data_type=TimeSeries and cdm_timeseries_variables=aCsvListOfStationRelatedVarables in the global metadata and cf_role=timeseries_id for some variable). If a user requests a subset of the dataset in a .ncCF (an .nc file in DSG's Contiguous Ragged Array file format) or .ncCFMA file (a .nc file in DSG's Multidimensional Array file format), that file will be a valid CF DSG file. WARNING: However, if the dataset was set up incorrectly (so that the promises made by the metadata aren't true), then the response file will be technically valid but will be incorrect in some way."

There are several other gotchas in using this, most importantly your definition of the CDM data type may not be ERDDAPs, read the docs starting at https://erddap.github.io/setupDatasetsXml.html#cdm_data_type. Each type has certain other required metadata that tell ERDDAP which data plays a given role.

jcermauwedu commented 2 months ago

Thanks for your comments. I will read more into the cdm_data_types. All this is in attempt to get active acoustic data into the NGDAC and then become subsequently available via the ERDDAP service.

Looking at a reference file (Rutgers) deployment: https://gliders.ioos.us/erddap/info/ru32-20200111T1444-delayed/index.html. The cdm_data_type is TrajectoryProfile. So, we have stuck to this type for now. The NGDAC expects a series of profiles. Once the deployment is finished, I believe it can also take a series of profiles in a single trajectory. Some of the delayed deployments still upload a series of individual profiles.

I am still in the middle of creating a fully IOOS Compliance Checker version of the NGDAC netCDF-2.0 specification and a OG-1.0 format version of the same dataset. These can be referenced now at: https://acoustics.fish.washington.edu/erddap/tabledap/index.html?page=1&itemsPerPage=1000

The graph for those datasets now defaults to the same echogram. I only have a single profile walked over from the v2 to v2_OG (OG-1.0). The netCDF file is mostly compliant except for some time specifications that I do not necessarily agree with and opened an issue at the OG github.

Once I get these settled and fully walked over, I need to send samples to Leila@NGDAC (leila.baghdad-brahim@tetratech.com) for review.

In a nutshell, the format specifies the time and depth coordinate dimensions. The echogram has 20 bins per sample/ping for each time coordinate. So, our first attempt was to use time(time, bin) and depth(time, bin). But this creates a lot of wasted space, even with netCDF's handling of missing values, sparse data.

Our next attempt is just to create an independent set of time and depth coordinate dimensions. Add a prefix echogram_ to the dimensions. This creates a completely independent set of axis and elegantly separates the typical environmental data: temperature and salinity from the active acoustic data. It also allows efficient storage of both sets of information and also allows us to maintain a single set of profiles or a single trajectory file.

v2:

 double echogram_sv(echogram_time, echogram_bin) ;
                echogram_sv:_FillValue = NaN ;
                echogram_sv:units = "1" ;
                echogram_sv:long_name = "Volume backscattering strength" ;
                echogram_sv:colorBarMinimum = -80. ;
                echogram_sv:colorBarMaximum = -30. ;
                echogram_sv:colorBarPalette = "EK80" ;
                echogram_sv:comment = "dimensionless units (dB re 1 m-1)" ;
                echogram_sv:ioos_category = "Other" ;
                echogram_sv:standard_name = "acoustic_volume_backscattering_strength_in_sea_water" ;
                echogram_sv:platform = "platform" ;
                echogram_sv:observation_type = "measured" ;
                echogram_sv:coordinates = "echogram_time echogram_depth echogram_lon echogram_lat" ;

Unfortunately OG-1.0 also requires us to define separate coordinates beyond N_MEASUREMENTS.

v2_OG:

        double ECHOGRAM_SV(ECHOGRAM_N_MEASUREMENTS, ECHOGRAM_N_BINS) ;
                ECHOGRAM_SV:_FillValue = NaN ;
                ECHOGRAM_SV:units = "1" ;
                ECHOGRAM_SV:long_name = "Volume backscattering strength" ;
                ECHOGRAM_SV:colorBarMinimum = -80. ;
                ECHOGRAM_SV:colorBarMaximum = -30. ;
                ECHOGRAM_SV:colorBarPalette = "EK80" ;
                ECHOGRAM_SV:comment = "dimensionless units (dB re 1 m-1)" ;
                ECHOGRAM_SV:ioos_category = "Other" ;
                ECHOGRAM_SV:standard_name = "acoustic_volume_backscattering_strength_in_sea_water" ;
                ECHOGRAM_SV:platform = "platform" ;
                ECHOGRAM_SV:observation_type = "measured" ;
                ECHOGRAM_SV:coordinates = "lat_uv lon_uv time_uv" ;

Still hammering on this but I will be happy to share example datasets and XML configuration files for ERDDAP that enables these to work. It seems like I need to take a deep dive into the example data in ERDDAP in reference to the cdm_data_type.

What is important here is there are 2d glider datasets accumulating. This includes ADCP data also now being collected on glider platforms also 2d in nature. Not the fixed mooring platforms of GCOOS (https://erddap.gcoos.org/erddap/info/wmo_42385/index.html) for which I was asked to look at these for reference to help us form a data model for the active acoustic data.