requirements for a CF file to display in the portal

emontgomery-usgs commented 6 years ago

@kwilcox

One of our scientists wants to generate portal-compliant CF directly, bypassing the EPIC version. Fine with me, as long as you're not doing your own conversion from our EPIC to ingest new data. Please confirm that you're currently importing to the portal from files in http://geoport.whoi.edu/thredds/dodsC/silt/usgs/Projects/stellwagen/CF-1.6/catalog.html.

With this in mind, is there a document saying which features in CF are mandatory for data to play nice in the portal? I have 2 datasets released by other centers that are nominally CF compliant, but I suspect are missing some critical elements (like standard_name, coordinates and FeatureType attributes, and possibly how time and depth are referenced.). Knowing the necessary elements will be helpful for writing translators (if needed), and for writing portal-compliant CF files from the get-go.

Thanks!

emontgomery-usgs commented 6 years ago

@kwilcox, got no response to this. Does guidance exist for someone wanting to generate portal-compliant CF? Thanks!

rsignell-usgs commented 6 years ago

@kwilcox I'm guessing the answer is that the dataset must:

pass the IOOS compliance checker with CF 1.6 selected as the compliance check.
contain the value CMG_Portal in the Project Global Attribute, for example, as shown here: https://gamone.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html. Or is that only true for model output?

kwilcox commented 6 years ago

If it passes the compliance checker tests for a DSG file it will work in the portal. I am not scanning anything but the CSW server. Each individual file will need to make it into the CSW catalog before it is automatically added to the portal.

rsignell-usgs commented 6 years ago

@kwilcox, just to clarify, the DSG file should be tested with the "CF 1.6" check, right?
Does it also need the "ACDD 1.1" or "ACDD 1.3" check?

kwilcox commented 6 years ago

If you want the files to be easily discoverable you should pass as many ACDD checks as possible but it will have no effect on the portal.

Other requirements outside of CF are:

global:original_folder - The project name (WFAL_2016, DAUPHIN, etc.)
global:MOORING - The unique mooring ID
global:naming_authority - gov.usgs.cmgp
global:id - The unique identifier. Must include the mooring and the sensor package. Historically set to the filename since that is already identifiable and recognizable (9651hwlb-a, 10831dw-a, 9431rb-a, etc.).
The feature_type_instance variable (or whatever variable has cf_role: timeseries_id) should be equal to global:id.
z-index data should be positive: up. This isn't required but the station netCDF files are standardized like this right now.

kwilcox commented 5 years ago

Updating for new harvesting method. These are in addition to satisfying as many CF/ACDD conventions as possible.

Required Global Attributes

global:experiment_id or global:original_folder - The name of the experiment (WFAL_2016, DAUPHIN)
global:MOORING - The unique mooring ID, is interpreted as a string internally.
global:metadata_link - External link to a reference to this experiment/station (DOI link preferred if available).
global:id - A unique identifier for this file. Historically set to the filename since that is already identifiable and recognizable (9651hwlb-a, 10831dw-a, 9431rb-a, etc.). This is used in the portal as a discriminant to distinguish variables of the same type being measured differently at the same mooring.

The combination of experiment_id and MOORING should produce a unique feature type for the portal (station page).

Optional Global Attributes

global:cmgportal_ignore - If this attribute is present the file is ignored from ingestion.

Required Variable Attributes

var:coordinates - Variables without a coordinates attribute are ignored. This should adhere to CF.
var:standard_name - Variables without a standard_name are ignored.
var:units - Variables without a units attribute are ignored.

Optional Variable Attributes

var:vertical_datum - If the variable's data needs a vertical_datum associated with it to be interpreted correctly it should go here. The global crs is ignored for vertical datums at the variable level.
var:discriminant - Will over-ride the global:id attribute and be used for the discriminant.
var:cmgportal_ignore - If this attribute is present then the variable is ignored from ingestion.

emontgomery-usgs commented 5 years ago

@kwilcox This seems reasonable, and should work on most of our data.

We do not currently have any instances of cmgportal_ignore, but could use it to not ingest burst data types where time is 2d (which I believe would break everything). Does this global att require a value of a certain type? Should we choose 1/0 or a char y/n?

kwilcox commented 5 years ago

I'm only checking if the cmgportal_ignore attribute is present in the globals or variable attributes and never reading the value. The type and value to use is up to you.

Variables with interesting dimensionality (2d time) are already "ignored" automatically because they don't fall nicely into the visualization buckets of the other data. It would be optional to set the ignore attribute on those. I was thinking to use this on datasets that should not be published in the portal yet (undergoing QC still?) or variables we "know" shouldn't be displayed (Tx_1211) so we can avoid hard-coding those types of things into the conversion scripts.

emontgomery-usgs commented 5 years ago

OK, we'll make our own rules :>.

I had missed that it could be a global or variable att. The capacity to skip an entire file or just a variable will be powerful!

rsignell-usgs commented 5 years ago

@kwilcox, @dnowacki-usgs is working on getting data from other centers into CF-compliant NetCDF files and then into ERDDAP.

Could we just harvest the ISO records from ERDDAP into pycsw, or do we need the ISO records created by nciso, crawling a THREDDS catalog that contains the NetCDF files because they have the DAP links?

rsignell-usgs commented 5 years ago

Just in case it's important: we noticed that the thredds-crawled ISO time series records on gamone:/opt/docker/pycsw/store/iso_records/ts were last updated November 15, 2018

rsignell-usgs commented 5 years ago

Wait a minute. @kwilcox you don't get the sensor data that displays in the portal from the CSW, do you?

kwilcox commented 5 years ago

We do not, but we should be. Have the metadata changes above all been implemented and are available from the CSW?

rsignell-usgs commented 5 years ago

So how about we do an experiment.

@dnowacki-usgs adds: Project=CMG_Portal and the above metadata attributes
we'll harvest the THREDDS ISO metadata from the resulting netcdf files into our CSW
we'll see if they show up in the portal

And if that works, we can the datasets from @emontgomery-usgs that currently are not making it to the portal.

dnowacki-usgs commented 5 years ago

@rsignell-usgs, @kwilcox: is Project=CMG_Portal still a required attr? It's not in Kyle's comment from 2019-04-09. Can we get clarification on that?

Here's the catalog page for four datasets that (I think) satisfy all the attrs described in the above-linked comment that would be good for a test!

kwilcox commented 5 years ago

The requirements above were for a dataset to be valid for ingestion. To identify the datasets from a CSW we need another key. I’m mobile and can’t confirm the ‘Project=CMG_Portal’ but it would be the same for the modeling datasets we are already filtering on from the CSW.

dnowacki-usgs commented 5 years ago

OK. I just added project=CMG_Portal as was done in the dataset Rich linked.

emontgomery-usgs commented 5 years ago

@dnowacki-usgs: I looked at one of the files in the catalog and noticed the dimension variables (time, z) have _FillValue attributes. FYI- it displays OK in Panoply, though upside down due to z-positive-down in the file. I will be interested to see how this works out in the portal; maybe having the _FillValue in dimensions is no longer an issue.

dnowacki-usgs commented 5 years ago

Good catch, Ellyn. According to CF, coordinate variables can't have _FillValue (still 😃). Maybe it would have worked in the portal, but I updated the files anyway.

emontgomery-usgs commented 5 years ago

Yeehaw! Though I was hoping to be proved wrong.

dnowacki-usgs commented 5 years ago

@rsignell-usgs Anything else magic needed for these to be ingested by the CSW? Querying the CSW for these datasets doesn't bring up any results...

rsignell-usgs commented 5 years ago

@dnowacki-usgs we need to add that THREDDS catalog to the list of catalogs being crawled. It's like step 3 for adding model data except that instead of editing the NcML list of catalogs we want to edit the non-NcML list of catalogs](https://github.com/USGS-CMG/usgs-cmg-portal/blob/master/catalog_harvest/get_daily_iso.py#L33-L35).

Can you submit a PR to add this catalog to the list? https://geoport.usgs.esipfed.org/thredds/catalog/sand/usgs/users/dnowacki/doi-F7VD6XBF/catalog.html

You should be able to do that easily by just clicking the pen icon and editing, since it's just a one line change.

dnowacki-usgs commented 5 years ago

OK, PR submitted and merged, and ISO metadata file has been generated in /opt/docker/pycsw/force/iso_records/F7VD6XBF. But these still don't show up when querying the CSW. Are there other steps that need to be done?

dnowacki-usgs commented 5 years ago

@kwilcox We have added four new timeseries data sets to the CSW, and they have CMG_Portal set. Can you confirm that you are able to harvest these new timeseries data to the portal?

Example title: Ocean Currents and Pressure Time Series at the Upper Florida Keys: Crocker Reef, FL: Site Aqua Example identifier: gov.USGS.coastal:UFK14Aqua1571aqc-trm

dnowacki-usgs commented 5 years ago

@kwilcox would love to get this sorted this week. Seems like we are very close... could you let me know if you are able to harvest from the CSW?

kwilcox commented 5 years ago

Taking a look at this now. It would be really great if we could distinguish the time-series datasets from other things in the CSW. @rsignell-usgs is the cdm_data_type or featureType exposed anywhere in the CSW server?

kwilcox commented 5 years ago

This isn't blocking me, I'll have these datasets done by the end of today, they all look good. But in the future, when the CSW has potentially thousands of datasets, it would be nice to filter on "this is a time-series station". Right now I'm testing that the geographic bounds are a point rather than a bbox. Without that I would need to hit each individual DAP endpoint to check the featureType.

kwilcox commented 5 years ago

Also to note, we are losing the ability to quickly process a single experiment or mooring if using the CSW method. There isn't enough information in the response CSW records to query for a project or individual mooring (unless you know the CSW ID for every dataset that makes up a mooring - and this would mean reading every single DAP endpoint).

@rsignell-usgs How can we get additional identifiers into the CSW record so we can track the experiment and mooring in each record? There are stored as global attributes right now... does the mapping between global attributes and identifiers need to happen in ncISO or can pycsw be configured to pull identifiers from specific locations in the ISO?

I thought I could solve this by requesting the full ISO response from your CSW server but when I change the response namespace from csw (the default CSW response) to gmd (which returns ISO records) I only get 5 results and none are the time-series datasets.

kwilcox commented 5 years ago

@dnowacki-usgs The time variable for UFK14S2 and UFK14S1 need the standard_name: time attribute added to the dataset.

https://geoport.usgs.esipfed.org/thredds/dodsC/sand/usgs/users/dnowacki/doi-F7VD6XBF/UFK14ArgE306aqd-trm.nc.html https://geoport.usgs.esipfed.org/thredds/dodsC/sand/usgs/users/dnowacki/doi-F7VD6XBF/UFK14ArgE1495aqd-trm.nc.html

dnowacki-usgs commented 5 years ago

@kwilcox: yikes, sorry. That was a failure to completely test the files on my part after some recent changes. Should be all good now.

kwilcox commented 5 years ago

@dnowacki-usgs Thanks, the 4 datasets look good and went right in. This is staging - curtain plots don't work correctly so select Time Series for those to see the data for now.

dnowacki-usgs commented 5 years ago

@kwilcox very cool, thanks! Glad to have achieved this milestone. I noticed that 100677 shows the doi.org link in the "Web site" field on the portal, but the other three link to the THREDDS catalog page on geoport. The doi would be preferable for all, and all have this url set in the metadata_link attr, so I'm not sure why some are different.

What other steps are necessary to get it from staging to production?

kwilcox commented 5 years ago

Roger, will use metadata_link going forward. I switched to the THREDDS URL after processing the first station resulting in the discrepancy.

dnowacki-usgs commented 5 years ago

Thanks @kwilcox.

Back in April you said the curtain plot bug was close to being fixed, which is apparently holding up our Grand Bay and Western Gulf of Maine datasets, in addition to this one from Florida. We'll be demoing the portal at the end of the week and we really want to have the new datasets in there... we can always add the staging URL but when people visit on their own we want them to see the same thing we are demoing.

kwilcox commented 5 years ago

We are close, curtain plots are working on this dev site: http://v2-launch.dev.axiomdatascience.com/?staging=true&portal_id=35&sensor_version=v2#metadata/100675/station/data. Will let you know as progress is made.

USGS-CMG / usgs-cmg-portal