Open emontgomery-usgs opened 6 years ago
@kwilcox, got no response to this. Does guidance exist for someone wanting to generate portal-compliant CF? Thanks!
@kwilcox I'm guessing the answer is that the dataset must:
CF 1.6
selected as the compliance check.CMG_Portal
in the Project
Global Attribute, for example, as shown here: https://gamone.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html. Or is that only true for model output?If it passes the compliance checker tests for a DSG file it will work in the portal. I am not scanning anything but the CSW server. Each individual file will need to make it into the CSW catalog before it is automatically added to the portal.
@kwilcox, just to clarify, the DSG file should be tested with the "CF 1.6" check, right?
Does it also need the "ACDD 1.1" or "ACDD 1.3" check?
If you want the files to be easily discoverable you should pass as many ACDD checks as possible but it will have no effect on the portal.
Other requirements outside of CF are:
global:original_folder
- The project name (WFAL_2016
, DAUPHIN
, etc.)global:MOORING
- The unique mooring IDglobal:naming_authority
- gov.usgs.cmgp
global:id
- The unique identifier. Must include the mooring and the sensor package. Historically set to the filename since that is already identifiable and recognizable (9651hwlb-a
, 10831dw-a
, 9431rb-a
, etc.).feature_type_instance
variable (or whatever variable has cf_role: timeseries_id
) should be equal to global:id
.positive: up
. This isn't required but the station netCDF files are standardized like this right now.Updating for new harvesting method. These are in addition to satisfying as many CF/ACDD conventions as possible.
global:experiment_id
or global:original_folder
- The name of the experiment (WFAL_2016
, DAUPHIN
)global:MOORING
- The unique mooring ID, is interpreted as a string internally.global:metadata_link
- External link to a reference to this experiment/station (DOI link preferred if available).global:id
- A unique identifier for this file. Historically set to the filename since that is already identifiable and recognizable (9651hwlb-a, 10831dw-a, 9431rb-a, etc.). This is used in the portal as a discriminant
to distinguish variables of the same type being measured differently at the same mooring.The combination of experiment_id
and MOORING
should produce a unique feature type for the portal (station page).
global:cmgportal_ignore
- If this attribute is present the file is ignored from ingestion.var:coordinates
- Variables without a coordinates
attribute are ignored. This should adhere to CF.var:standard_name
- Variables without a standard_name
are ignored.var:units
- Variables without a units
attribute are ignored.var:vertical_datum
- If the variable's data needs a vertical_datum
associated with it to be interpreted correctly it should go here. The global crs
is ignored for vertical datums at the variable level.var:discriminant
- Will over-ride the global:id
attribute and be used for the discriminant
.var:cmgportal_ignore
- If this attribute is present then the variable is ignored from ingestion.@kwilcox This seems reasonable, and should work on most of our data.
We do not currently have any instances of cmgportal_ignore, but could use it to not ingest burst data types where time is 2d (which I believe would break everything). Does this global att require a value of a certain type? Should we choose 1/0 or a char y/n?
I'm only checking if the cmgportal_ignore
attribute is present in the globals or variable attributes and never reading the value. The type and value to use is up to you.
Variables with interesting dimensionality (2d time) are already "ignored" automatically because they don't fall nicely into the visualization buckets of the other data. It would be optional to set the ignore attribute on those. I was thinking to use this on datasets that should not be published in the portal yet (undergoing QC still?) or variables we "know" shouldn't be displayed (Tx_1211
) so we can avoid hard-coding those types of things into the conversion scripts.
OK, we'll make our own rules :>.
I had missed that it could be a global or variable att. The capacity to skip an entire file or just a variable will be powerful!
@kwilcox, @dnowacki-usgs is working on getting data from other centers into CF-compliant NetCDF files and then into ERDDAP.
Could we just harvest the ISO records from ERDDAP into pycsw, or do we need the ISO records created by nciso, crawling a THREDDS catalog that contains the NetCDF files because they have the DAP links?
Just in case it's important: we noticed that the thredds-crawled ISO time series records on gamone:/opt/docker/pycsw/store/iso_records/ts were last updated November 15, 2018
Wait a minute. @kwilcox you don't get the sensor data that displays in the portal from the CSW, do you?
We do not, but we should be. Have the metadata changes above all been implemented and are available from the CSW?
So how about we do an experiment.
Project=CMG_Portal
and the above metadata attributesAnd if that works, we can the datasets from @emontgomery-usgs that currently are not making it to the portal.
@rsignell-usgs, @kwilcox: is Project=CMG_Portal
still a required attr? It's not in Kyle's comment from 2019-04-09. Can we get clarification on that?
Here's the catalog page for four datasets that (I think) satisfy all the attrs described in the above-linked comment that would be good for a test!
The requirements above were for a dataset to be valid for ingestion. To identify the datasets from a CSW we need another key. I’m mobile and can’t confirm the ‘Project=CMG_Portal’ but it would be the same for the modeling datasets we are already filtering on from the CSW.
OK. I just added project=CMG_Portal
as was done in the dataset Rich linked.
@dnowacki-usgs: I looked at one of the files in the catalog and noticed the dimension variables (time, z) have _FillValue attributes. FYI- it displays OK in Panoply, though upside down due to z-positive-down in the file. I will be interested to see how this works out in the portal; maybe having the _FillValue in dimensions is no longer an issue.
Good catch, Ellyn. According to CF, coordinate variables can't have _FillValue (still 😃). Maybe it would have worked in the portal, but I updated the files anyway.
Yeehaw! Though I was hoping to be proved wrong.
@rsignell-usgs Anything else magic needed for these to be ingested by the CSW? Querying the CSW for these datasets doesn't bring up any results...
@dnowacki-usgs we need to add that THREDDS catalog to the list of catalogs being crawled. It's like step 3 for adding model data except that instead of editing the NcML list of catalogs we want to edit the non-NcML list of catalogs](https://github.com/USGS-CMG/usgs-cmg-portal/blob/master/catalog_harvest/get_daily_iso.py#L33-L35).
Can you submit a PR to add this catalog to the list? https://geoport.usgs.esipfed.org/thredds/catalog/sand/usgs/users/dnowacki/doi-F7VD6XBF/catalog.html
You should be able to do that easily by just clicking the pen icon and editing, since it's just a one line change.
OK, PR submitted and merged, and ISO metadata file has been generated in /opt/docker/pycsw/force/iso_records/F7VD6XBF
. But these still don't show up when querying the CSW. Are there other steps that need to be done?
@kwilcox We have added four new timeseries data sets to the CSW, and they have CMG_Portal
set. Can you confirm that you are able to harvest these new timeseries data to the portal?
Example title: Ocean Currents and Pressure Time Series at the Upper Florida Keys: Crocker Reef, FL: Site Aqua Example identifier: gov.USGS.coastal:UFK14Aqua1571aqc-trm
@kwilcox would love to get this sorted this week. Seems like we are very close... could you let me know if you are able to harvest from the CSW?
Taking a look at this now. It would be really great if we could distinguish the time-series datasets from other things in the CSW. @rsignell-usgs is the cdm_data_type or featureType exposed anywhere in the CSW server?
This isn't blocking me, I'll have these datasets done by the end of today, they all look good. But in the future, when the CSW has potentially thousands of datasets, it would be nice to filter on "this is a time-series station". Right now I'm testing that the geographic bounds are a point rather than a bbox. Without that I would need to hit each individual DAP endpoint to check the featureType.
Also to note, we are losing the ability to quickly process a single experiment or mooring if using the CSW method. There isn't enough information in the response CSW records to query for a project or individual mooring (unless you know the CSW ID for every dataset that makes up a mooring - and this would mean reading every single DAP endpoint).
@rsignell-usgs How can we get additional identifiers
into the CSW record so we can track the experiment and mooring in each record? There are stored as global attributes right now... does the mapping between global attributes and identifiers need to happen in ncISO or can pycsw be configured to pull identifiers from specific locations in the ISO?
I thought I could solve this by requesting the full ISO
response from your CSW server but when I change the response namespace from csw
(the default CSW response) to gmd
(which returns ISO records) I only get 5 results and none are the time-series datasets.
@dnowacki-usgs The time
variable for UFK14S2
and UFK14S1
need the standard_name: time
attribute added to the dataset.
https://geoport.usgs.esipfed.org/thredds/dodsC/sand/usgs/users/dnowacki/doi-F7VD6XBF/UFK14ArgE306aqd-trm.nc.html https://geoport.usgs.esipfed.org/thredds/dodsC/sand/usgs/users/dnowacki/doi-F7VD6XBF/UFK14ArgE1495aqd-trm.nc.html
@kwilcox: yikes, sorry. That was a failure to completely test the files on my part after some recent changes. Should be all good now.
@dnowacki-usgs Thanks, the 4 datasets look good and went right in. This is staging - curtain plots don't work correctly so select Time Series for those to see the data for now.
@kwilcox very cool, thanks! Glad to have achieved this milestone. I noticed that 100677 shows the doi.org link in the "Web site" field on the portal, but the other three link to the THREDDS catalog page on geoport. The doi would be preferable for all, and all have this url set in the metadata_link
attr, so I'm not sure why some are different.
What other steps are necessary to get it from staging to production?
Roger, will use metadata_link
going forward. I switched to the THREDDS URL after processing the first station resulting in the discrepancy.
Thanks @kwilcox.
Back in April you said the curtain plot bug was close to being fixed, which is apparently holding up our Grand Bay and Western Gulf of Maine datasets, in addition to this one from Florida. We'll be demoing the portal at the end of the week and we really want to have the new datasets in there... we can always add the staging URL but when people visit on their own we want them to see the same thing we are demoing.
We are close, curtain plots are working on this dev site: http://v2-launch.dev.axiomdatascience.com/?staging=true&portal_id=35&sensor_version=v2#metadata/100675/station/data. Will let you know as progress is made.
@kwilcox
One of our scientists wants to generate portal-compliant CF directly, bypassing the EPIC version. Fine with me, as long as you're not doing your own conversion from our EPIC to ingest new data. Please confirm that you're currently importing to the portal from files in http://geoport.whoi.edu/thredds/dodsC/silt/usgs/Projects/stellwagen/CF-1.6/catalog.html.
With this in mind, is there a document saying which features in CF are mandatory for data to play nice in the portal? I have 2 datasets released by other centers that are nominally CF compliant, but I suspect are missing some critical elements (like standard_name, coordinates and FeatureType attributes, and possibly how time and depth are referenced.). Knowing the necessary elements will be helpful for writing translators (if needed), and for writing portal-compliant CF files from the get-go.
Thanks!