SECOORA / skill_score

Prototypes for the SECOORA skill score
MIT License
7 stars 1 forks source link

Finding data outside the defined bbox #197

Closed ocefpaf closed 9 years ago

ocefpaf commented 9 years ago

The notebooks uses the bbox below.

bbox = [-87.40, 24.25, -74.70, 36.70]

and filters the data search with this fes define filter:

filter_list = [fes.And([fes.BBox(bbox), begin, end, or_filt, not_filt])]

However, several DAP URLs returned are outside the search box. See https://github.com/ioos/secoora/blob/master/notebooks/timeSeries/sst/2014-07-07/log.txt#L245-L284.

or more specifically http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html

These stations metadata seems to be correct though.

geospatial_lat_min: 33.847797
geospatial_lat_max: 33.861347
geospatial_lat_units: degrees_north
geospatial_lat_resolution: 1.0E-5
geospatial_lon_min: -118.64079
geospatial_lon_max: -118.62448
geospatial_lon_units: degrees_east

For now I am filtering these stations out with:

dap_urls = [url for url in dap_urls if not is_station(url)]

where,

from netCDF4 import Dataset
def is_station(url):
    nc = Dataset(url)
    station = False
    if hasattr(nc, 'cdm_data_type'):
        if nc.cdm_data_type.lower() == 'station':
            station = True
    return station

That was the fastest route I found to circumvent this issue. Avoiding xray and iris objects by querying for the cdm_data_type directly with the raw netCDF4 interface, but it is still slow very slow :unamused:

It would be nice to find the source of the problem and fix it. @rsignell-usgs any ideas?

rsignell-usgs commented 9 years ago

@ocefpaf, indeed, the longitude/latitude range metadata in the OPeNDAP dataset are correct, yet the metadata in the ISO record are incorrect, with east longitude and south latitude set to 0.0:

                     <gmd:westBoundLongitude>
                        <gco:Decimal>-118.64013671875</gco:Decimal>
                     </gmd:westBoundLongitude>
                     <gmd:eastBoundLongitude>
                        <gco:Decimal>0.0</gco:Decimal>
                     </gmd:eastBoundLongitude>
                     <gmd:southBoundLatitude>
                        <gco:Decimal>0.0</gco:Decimal>
                     </gmd:southBoundLatitude>
                     <gmd:northBoundLatitude>
                        <gco:Decimal>33.86117172241211</gco:Decimal>
                     </gmd:northBoundLatitude>

I checked the bounds on one pacioos time series records ISO record and the bounds seem fine.

PACIOOS is at TDS Version 4.3.20 - 20131125.1409. which is identical to CDIP, but perhaps PACIOOS has a different ncISO version?

@pacioos, do you have any insight on this?

or @lukecampbell / @kknee, I see you somehow got these bounding boxes from east coast CDIP correct: https://github.com/ioos/catalog/issues/129

pacioos commented 9 years ago

As documented at the bottom of our ncISO records, PacIOOS is using ncISO Version 2.3. I've noticed ncISO get the bounds wrong before, but for us these are rare cases when the dataset is of type RADIAL (i.e. HFR).

Note that ncISO looks in more than one place for the longitude/latitude range metadata. In addition to looking at CF/NetCDF attributes (e.g. geospatial_lat_min, etc.) it is also looking at the THREDDS catalog metadata (geospatialCoverage), so I would make sure you have it defined in both places to see if that fixes the problem.

Cheers, John

On Tue, Aug 11, 2015 at 2:38 AM, Rich Signell notifications@github.com wrote:

@ocefpaf https://github.com/ocefpaf, indeed, the longitude/latitude range metadata in the OPeNDAP dataset are correct, yet the metadata in the ISO record http://thredds.cdip.ucsd.edu/thredds/iso/cdip/archive/028p1/028p1_d16.nc?catalog=http%3A%2F%2Fthredds.cdip.ucsd.edu%2Fthredds%2Fcatalog%2Fcdip%2Farchive%2F028p1%2Fcatalog.html&dataset=CDIP_Archive%2F028p1%2F028p1_d16.nc are incorrect, with east longitude and south latitude set to 0.0:

                 <gmd:westBoundLongitude>
                    <gco:Decimal>-118.64013671875</gco:Decimal>
                 </gmd:westBoundLongitude>
                 <gmd:eastBoundLongitude>
                    <gco:Decimal>0.0</gco:Decimal>
                 </gmd:eastBoundLongitude>
                 <gmd:southBoundLatitude>
                    <gco:Decimal>0.0</gco:Decimal>
                 </gmd:southBoundLatitude>
                 <gmd:northBoundLatitude>
                    <gco:Decimal>33.86117172241211</gco:Decimal>
                 </gmd:northBoundLatitude>

I checked the bounds on one pacioos time series records ISO record http://oos.soest.hawaii.edu/thredds/iso/hioos/nss/ns01agg?catalog=http%3A%2F%2Foos.soest.hawaii.edu%2Fthredds%2Fidd%2Fnss_hioos.html&dataset=NS01agg and the bounds seem fine.

PACIOOS is at TDS Version 4.3.20 - 20131125.1409. which is identical to CDIP, but perhaps PACIOOS has a different ncISO version?

@pacioos https://github.com/pacioos, do you have any insight on this?

or @lukecampbell/@kknee https://github.com/kknee, I see you somehow got these bounding boxes from east coast CDIP correct: ioos/catalog#129 https://github.com/ioos/catalog/issues/129

— Reply to this email directly or view it on GitHub https://github.com/ioos/secoora/issues/197#issuecomment-129858443.

rsignell-usgs commented 9 years ago

@pacioos, CDIP is also using ncISO Version 2.3, so I guess the only difference left is that you duplicated the bounds in the THREDDS metadata and also with attributes inside the dataset, yet CDIP they only specified the bounds with attributes inside the dataset.

But that should be okay, so it sounds like we have a bug in ncISO for when you only specify the attributes.

But do you understand the xsl enough to see why geospatial_lat_min and geospatial_lon_max would get set to 0 (or perhaps thats the default value?), when geospatial_lat_max and geospatial_lat_max are apparently okay?

https://github.com/Unidata/threddsIso/blob/master/src/main/resources/xsl/nciso/UnidataDD2MI.xsl

@geoneubie, perhaps you see the problem?

rsignell-usgs commented 9 years ago

@amilan17 or @shane-axiom, how about you guys?
Can you spot the issue in the XSL?

amilan17 commented 9 years ago

The NcML has the lat and lon described twice.

First as described above

  <attribute name="geospatial_lat_min" type="float" value="33.847797"/>
  <attribute name="geospatial_lat_max" type="float" value="33.861347"/>
  <attribute name="geospatial_lon_min" type="float" value="-118.64079"/>
  <attribute name="geospatial_lon_max" type="float" value="-118.62448"/>

Secondly in a CFMetadata group

<group name="CFMetadata">
    <attribute name="geospatial_lon_min" value="-118.64013671875" type="float"/>
    <attribute name="geospatial_lat_min" value="0.0" type="float"/>
    <attribute name="geospatial_lon_max" value="0.0" type="float"/>
    <attribute name="geospatial_lat_max" value="33.86117172241211" type="float"/>
    <attribute name="geospatial_lon_resolution" value="0.0"/>
    <attribute name="geospatial_lat_resolution" value="0.0"/>
  </group>

As you can see the CF Metadata attributes are wrong and this is what the XSL is looking at translating first. I'm not sure why these attributes are not correct in the NcML....

http://thredds.cdip.ucsd.edu/thredds/ncml/cdip/archive/028p1/028p1_d16.nc?catalog=http%3A%2F%2Fthredds.cdip.ucsd.edu%2Fthredds%2Fcatalog%2Fcdip%2Farchive%2F028p1%2Fcatalog.html&dataset=CDIP_Archive%2F028p1%2F028p1_d16.nc

rsignell-usgs commented 9 years ago

Anna,

Very interesting. The NcML service is of course also produced by ncISO, so if the problem is not in the XSL, then in must be in the ncISO java code itself.

Because CDIP is specifying these attributes correctly in the original data as evidenced by the global attributes here: http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html Note there is only one correct specification of the limits.

Note also there is no specification of geospatial bounds in the THREDDS metadata or else we would see the limits here at the dataset level view:

http://thredds.cdip.ucsd.edu/thredds/catalog/cdip/archive/028p1/catalog.html?dataset=CDIP_Archive/028p1/028p1_d16.nc

But that should be OKAY!

The NcISO precedence rules https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery#Determining_an_Order_of_Precedence say that UDDC should take precedence over TDS attributes, and since they are not even supplied, everything should be fine.

So I conclude there must be a coding error somewhere in the ncISO java code that this use case has uncovered.

I looked a bit here: https://github.com/Unidata/threddsIso/search?utf8=%E2%9C%93&q=maxLon but couldn't find the problem.

@shane-axiom, can you find it?

amilan17 commented 9 years ago

searching for 'CFMetadata' might get you closer to where this code is in the java. https://github.com/Unidata/threddsIso/search?utf8=%E2%9C%93&q=CFMetadata&type=Code

Anna ~~Metadata Adds Meaning.~~ Anna.Milan@noaa.gov, 303-497-5099 NOAA National Centers for Environmental Information formerly NOAA’s National Geophysical Data Center http://www.ngdc.noaa.gov/metadata/emma ~~~~~~~~

On Wed, Aug 12, 2015 at 11:44 AM, Rich Signell notifications@github.com wrote:

Anna,

Very interesting. The NcML service is of course also produced by ncISO, so if the problem is not in the XSL, then in must be in the ncISO java code itself.

Because CDIP is specifying these attributes correctly in the original data as evidenced by the global attributes here:

http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html Note there is only one correct specification of the limits.

Note also there is no specification of geospatial bounds in the THREDDS metadata or else we would see the limits here at the dataset level view:

http://thredds.cdip.ucsd.edu/thredds/catalog/cdip/archive/028p1/catalog.html?dataset=CDIP_Archive/028p1/028p1_d16.nc

But that should be OKAY!

The NcISO precedence rules https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery#Determining_an_Order_of_Precedence say that UDDC should take precedence over TDS attributes, and since they are not even supplied, everything should be fine.

So I conclude there must be a coding error somewhere in the ncISO java code that this use case has uncovered.

I looked a bit here: https://github.com/Unidata/threddsIso/search?utf8=%E2%9C%93&q=maxLon but couldn't find the problem.

@shane-axiom https://github.com/shane-axiom, can you find it?

— Reply to this email directly or view it on GitHub https://github.com/ioos/secoora/issues/197#issuecomment-130387259.

srstsavage commented 9 years ago

CFMetadata geospatial extent is calculated here:

https://github.com/Unidata/threddsIso/blob/master/src/main/java/thredds/server/metadata/util/ThreddsExtentUtil.java#L205

It doesn't appear to care about the geospatial_min_lat etc attributes and instead scans through the latitude and longitude variables.

The dataset in question (028p1_d16.nc) does actually have 0 values in both the lat and lon data variables (gpsLatitude and gpsLongitude), and since these don't match the defined _FillValue of -999.99f for those variables the 0 values are respected.

TL;DR the dataset is broken and threddsIso ignores geospatial extent attributes.

ocefpaf commented 9 years ago

@shane-axiom Awesome! Thanks for pinpointing this!!

rsignell-usgs commented 9 years ago

Interesting, but ncISO should not be reading gpsLongitude and gpsLatitude!

Looking at the OPeNDAP access form: http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html I see the coordinates attribute on every variable points to metaDeployLatitude and metaDeployLongitude, which do have valid values: http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.ascii?metaDeployLatitude,metaDeployLongitude

Also, the UDDC attributes should have taken precendence anyway, right @geoneubie? https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery#Determining_an_Order_of_Precedence

geoneubie commented 9 years ago

FeautreType is not grid for this dataset so ncISO knows it can't get lat long values in the traditional way of calculating them from the CF CoordinateAxes (because the values returned by the NetCDF libraries are not the true lat lon min max values in that case). So ncISO loops through the available variables until it finds one with a standard_name attribute of latitude or longitude. In this instance there are two variables gpsLatitude and metadataDeployLatitude with standard_name latitude and apparently gpsLatitude is found first. I see two quick fixes: 1) Change the standard_name of gpsLatitude variable or 2) Override the default xslt to use the ncml attributes. I can help with 2) if you need information on how to do that. Not sure if there is a clean longer term fix, but I'm open to ideas.

Dave

On Wed, Aug 12, 2015 at 12:21 PM, Rich Signell notifications@github.com wrote:

Interesting, but ncISO should not be reading gpsLongitude and gpsLatitude!

Looking at the OPeNDAP access form:

http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html I see the coordinates attribute on every variable points to metaDeployLatitude and metaDeployLongitude, which do have valid values:

http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.ascii?metaDeployLatitude,metaDeployLongitude

Also, the UDDC attributes should have taken precendence anyway, right @geoneubie https://github.com/geoneubie?

https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery#Determining_an_Order_of_Precedence

— Reply to this email directly or view it on GitHub https://github.com/ioos/secoora/issues/197#issuecomment-130400374.

srstsavage commented 9 years ago

Should ncISO check for the existence of geospatial_min_lat et al. and use those values if present, regardless of whether the dataset is or isn't a grid?

geoneubie commented 9 years ago

ncISO does check, but the precedence rules favor calculated values over human maintained values (to answer Rich's earlier question). These can be overriden at an individual site by modifying the xslt at the local TDS install.

lukecampbell commented 9 years ago

FWIW I'm in favor of having the values calculated using data in the variable in lieu of the global attributes. In my experience, unless automation is in play, most folks don't update metadata once the base set has been applied.

rsignell-usgs commented 9 years ago

@lesserwhirls, @dopplershift or @cwardgar, I'm wondering about this @geoneubie statement above:

"FeatureType is not grid for this dataset so ncISO knows it can't get lat long values in the traditional way of calculating them from CF CoordinateAxes (because the values returned by the NetCDF libraries are not the true lat lon min max values in that case)."

The FeatureType for the dataset in question is timeSeries. Is there a netcdf-java method that gives the correct lon/lat bounding box for a FeatureType='timeSeries'?

cwardgar commented 9 years ago

Hi Rich,

The CF feature type timeSeries maps to the CDM feature type STATION. A STATION is expected to be fixed at a single lat/lon location, and so there are no meaningful methods for finding a bounding box. The relevant interface is StationTimeSeriesFeature.

You'll notice that there is a method getBoundingBox(), but it is inherited from a super-class and doesn't have the semantics you're looking for. In fact, you can see where the value is set here. It's just an "empty" box centered around the station's location.

FWIW, the TRAJECTORY feature type has a meaningful bounding box.

rsignell-usgs commented 9 years ago

@cwardgar , so what should the logic be to determine the lon/lat extent of this timeSeries feature type?

ncISO is just picking the first variables it finds with standard_name of longitude and latitude, which is gpsLatitude and gpsLongitude, instead of metaDeployLatitude and metaDeployLongitude. The latter are single point values defined in the coordinates attributes of the data variables, while the gpsLatitude and gpsLongitude are the actual gps readings that are a time series, and the values move around a bit with the buoy watch circle.

Arguably the actual bounding box should be the gpsLatitude and gpsLongitude should be the actual bounding box, but as you say, one expects only a single lat/lon point for a timeSeries featureType.

rsignell-usgs commented 9 years ago

@geoneubie, the CDIP folks (Darren Wright) removed the standard_name attribute from gpsLongitude and gpsLatitude variables on these problem datasets (see the OPeNDAP URL http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/028p1/028p1_d16.nc.html for example), but the ISO metadata produced by ncISO is still showing 0.0 for min lat and lon:

http://thredds.cdip.ucsd.edu/thredds/iso/cdip/archive/028p1/028p1_d16.nc?catalog=http%3A%2F%2Fthredds.cdip.ucsd.edu%2Fthredds%2Fcatalog%2Fcdip%2Farchive%2F028p1%2Fcatalog.html&dataset=CDIP_Archive%2F028p1%2F028p1_d16.nc

Is the logic being used by ncISO different than you thought or do we have some other kind of problem?

cwardgar commented 9 years ago

@rsignell-usgs, in the CDM, stations (i.e. timeSeries) are associated with a single lat/lon location, even if it's nominal (so metaDeployLatitude and metaDeployLongitude should be what the CDM chooses). We don't have a concept of a buoy that drifts away from an initial location. However, a potential way to solve that would be to subclass StationTimeSeriesFeatureImpl and then override getBoundingBox() to return the box around gpsLatitude and gpsLongitude.

Of course, that all assumes that ncIso is using NetCDF-Java's PointFeature machinery in the first place, which doesn't seem to be the case.

ocefpaf commented 9 years ago

I cannot fix this issue for the skill-score as it is a data provider issue. A workaround is already in place and my notebooks are ignore these stations. So I am closing this issue.

PS: Please feel free to continue using this thread to discuss the problem though. I am just cleaning the secoora skill score issue tracker.

rsignell-usgs commented 7 years ago

@lesserwhirls, it seems the fundamental problem here is that nciso is not using the built-in netcdf-java functionality for determining the bounding box. Do you agree?

If you do agree, then perhaps @noaaroland can take a look, with @kevin-obrien approval, of course.

lesserwhirls commented 7 years ago

Yes. However, moving the code to use the PointFeature stuff is probably not a super simple task, and it's changed so much in v5.0, that it would probably be worth waiting to make that change to ncIso once it is using v5.0. I do have a branch of ncIso building with the latest snapshot of v5.0 (probably needs updated with latest changes to master), and would happily welcome any feedback on the 5.0 api for point features.