WikiWatershed / model-my-watershed

The web application front end for Model My Watershed.
https://modelmywatershed.org
Apache License 2.0
57 stars 31 forks source link

WDC GetSeriesCatalogForBox3 provides richer response and more search parameters than GetSeriesCatalogForBox2 #1931

Open emiliom opened 7 years ago

emiliom commented 7 years ago

@kdeloach, I think you (and I) have so far not done much testing on the WDC GetSeriesCatalogForBox3 API query. My "series" queries have focused on GetSeriesCatalogForBox2 (eg, this notebook). But it turns out that GetSeriesCatalogForBox3 has two important advantages:

I ran a test, and GetSeriesCatalogForBox3 seems to work fine, though probably with the same current bugs and keyword limitations as in GetSeriesCatalogForBox2.

cc @aufdenkampe

kdeloach commented 7 years ago

Thanks for looking in to this. Do we want to display series or sites? Our current implementation uses GetSitesInBox2 combined with GetServicesInBox2 (Source). It doesn't look it would be difficult to use GetSeriesCatalogForBox3 instead though.

emiliom commented 7 years ago

Ah, I didn't fully catch that. Got it.

Using GetSitesInBox2 combined with GetServicesInBox2 for now seems ok. @aufdenkampe and I should first discuss what should be displayed in the results; depending on what we decide, a series response may make more sense.

BTW, for future consideration: it occurred to me that currently there are only 94 "data services" in WDC (ie, the maximum total that can be returned by GetServicesInBox2), and those change rarely. Maybe a future strategy for better search responsiveness could be to get all services in the US lower 48 (based on a GetServicesInBox2 query) and cache them once a day, then just use that cache together with the actual query results from GetSitesInBox2.

kdeloach commented 7 years ago

Because there are so few services, the GetServicesInBox2 request is very fast. For the moment, this isn't a performance bottleneck. But to avoid making unnecessary requests, and for the reasons you described, it may be a good idea to cache the results anyway. I'll create an issue for this.

kdeloach commented 7 years ago

One advantage I see in using GetSeriesCatalogForBox3 is that the results contain conceptKeyword which we can use for client-side filtering. Neither GetServicesInBox2 nor GetSitesInBox2 exposes this field.

emiliom commented 7 years ago

One advantage I see in using GetSeriesCatalogForBox3 is that the results contain conceptKeyword which we can use for client-side filtering.

Just for reference, this is a broader difference between the GetSeriesCatalogForBox* queries and your current approach; it's not a unique feature of GetSeriesCatalogForBox3 per se.

A GetSeriesCatalogForBox* request that doesn't specify a keyword will get multiple series per site. Client-side processing could be used to make the response more user friendly (I think) by grouping the different series into a single site "dataset" record. But that's getting into details that Anthony and I should probably discuss first.

aufdenkampe commented 7 years ago

​Yes! I agree with Emilio that we want to focus on the information in the GetSeriesCatalogForBox*, but group by Sitename. He and I should discuss this first, and soon.

The hierarchy of this information model here is that:

We want to organize our returns by Sites, but display (and search) the info on all the Series at each Site.

emiliom commented 7 years ago

FYI, specially for Anthony: I've updated the notebook gist I mentioned at the start of this issue, to include examples from both GetSeriesCatalogForBox2 and GetSeriesCatalogForBox3

aufdenkampe commented 7 years ago

@kdeloach and @ajrobbins , I had a conversation with @emiliom on Friday where we carefully explored the best approach for searching CUAHSI Water Data Center (WDC) and compiling the output provided to the user. We used these resources to inform our discussion:

The approach below resolves the open question about whether to use GetSitesInBox2 or GetSeriesCatalogForBox3. See #1858 and #1931 for more details. This will allow us to move forward on the CUAHSI WDC #1945. @emiliom can add where necessary.

A. These are the GET requests that Azavea should use:

  1. Azavea should run a GetSeriesCatalogForBox2​ GET request each time a user does a dataset search

    • xmin, xmax, ymin, ymax as required arguments
    • beginDate, endDate as optional arguments
    • don't use additional arguments
    • Note: we presently prefer this over GetSeriesCatalogForBox3​ because we're worried about query response time, and Box2 seems to provide just the right amount of metadata. (Note: While GetSeriesCatalogForBox2​​ does not have VariableUnitsAbbrev and GetSeriesCatalogForBox3 does, we're confident this will not be a problem because we only need units info once we fetch the actual data values. Let's reevaluate after we start fetching data.)
  2. Azavea should run a GetServicesInBox2​ GET request once per day (or week?) for the entire world, and save the results to be merged with the returned results from GetSeriesCatalogForBox2​ GET requests. This is captured in issue #1932.

  3. Azavea should NOT run any GetSitesInBox2* GET requests, because all relevant site info is included in each GetSeriesCatalogForBox2​ GET request.

B. Azavea should develop a means to combine and filter GET request results in the following ways.

  1. ​Combine/merge all records by site ('location' code) from a GetSeriesCatalogForBox2​ GET request. a. ​​​These fields, below, will be the same for each SeriesRecord with the same 'location' code

    • ​​​ServCode
    • ServURL
    • location
    • Sitename
    • latitude
    • longitude

    b. These fields, below, will be different for each SeriesRecord with the same 'location' code​, and will be grouped by VarCode

    • VarCode
    • VarName
    • beginDate
    • endDate
    • ValueCount
    • datatype
    • valuetype
    • samplemedium
    • timeunits
    • conceptKeyword
    • genCategory
    • TimeSupport
  2. ​Append each of these new SiteRecords (created above in B1 ) with the associated ServicesRecord metadata that was saved from A2, above, using the ServCode / ServiceID (there's a 1:1 map for these).

C. Later on, Azavea should develop a client-side means for filtering SiteRecords via a free-text search of all the terms in all the fields of the combined, hierarchical Site+Service(s)+Series result. Described in #1936.

D. We will likely want to develop ​​​Constructors to build (and expose) user friendly URLs for Services and selected Sites, which resolves the questions in #1859.

  1. Service URL. The ServURL in from GetServicesInBox2 does not provide a web-friendly URL (e.g. for ServCode = "NWISDV", ServURL = "http://hydroportal.cuahsi.org/nwisdv/cuahsi_1_1.asmx"). However, a friendly URL can easily be constructed from ServiceID (which has a 1:1 map with ServCode) by following this pattern: http://hiscentral.cuahsi.org/pub_network.aspx?n=1
  2. Site URLs can be similarly constructed from some Services, such as https://waterdata.usgs.gov/nwis/uv/?site_no=14113000, when ServCode = "NWISDV" and location = "NWISDV:14113000"​.
    • We would create Site URL constructor for a handful of important Services, such as USGS NWIS and Data.EnviroDIY. Let's start with ServCode = "NWISDV" as an example. We will not explore Site URLs from other services for now, but the list will expand in time.
kdeloach commented 7 years ago

These changes have been implemented in PR #1959. Check out the screenshots to compare the differences.

Notes:

GetSeriesCatalogForBox2 produces a greater volume of results, but the amount of metadata available hasn't increased much, compared to using GetSitesInBox2. The only fields common to series records are: ​​​ServCode, ServURL, Sitename, location, latitude, longitude, beginDate, and endDate. These are the fields we will expose from our API.

We still don't have access to these fields for each resource:

This is a known issue, but the beginDate and endDate filters don't seem to do anything. I get the same results no matter which dates I try.

We can dynamically generate URLs for each resource, if necessary. However, we don't need to generate URLs for services, since that is already available from the ServiceDescriptionURL field from GetServicesInBox2.

emiliom commented 7 years ago

GetSeriesCatalogForBox2 produces a greater volume of results,

Yes, that's expected, as Anthony has mentioned above.

but the amount of metadata available hasn't increased much, compared to using GetSitesInBox2

There's much more metadata coming in! See @aufdenkampe's comment above. Maybe what you mean is that there isn't much more metadata for the subset of metadata defined in your common dataset record metadata? Assuming this interpretation I'm making is correct, I guess that would be true b/c that dataset record metadata did not encompass the additional information Anthony listed in B.1.b that comment above (except for beginDate and endDate).

We still don't have access to these fields for each resource:

  • author
  • created date (currently, this field is populated with beginDate-- not sure if this is correct)
  • updated date

These do not exist in the WDC response, per se. But that should be ok.

Depending on how author is used, the service provider (derived from ​​​ServCode and the results of GetServicesInBox2) could be used for it.

This is a known issue, but the beginDate and endDate filters don't seem to do anything. I get the same results no matter which dates I try.

Ok. I guess this depends on the roll-out of the fixed WDC Catalog API, which hadn't been released as of June 7.

emiliom commented 7 years ago

With Kevin gone, I don't know if @rajadain is now automatically pinged. So I'm pinging him here.

rajadain commented 7 years ago

Thanks for pinging me @emiliom, I'll subscribe to all issues created so far so that I'm notified. I'll go through the discussion and respond here shortly.

emiliom commented 7 years ago

After reading through the comments in #1959, I think it's clearer we have a misunderstanding about what Anthony's and my intent was. It's clear that in that PR, the metadata specific to a "series" (which is sort-of a synonym for "variable" is thrown out.

Anthony and I will submit a much more specific request/recommendation for what should be shown in the WDC dataset record boxes on the UI.

rajadain commented 7 years ago

Thanks, we'll wait for that.

emiliom commented 7 years ago

Just a couple of references, for future use:

aufdenkampe commented 7 years ago

@rajadain, we just created the Sample_WDC_Site_Record_BiGCZPortal_SearchResult Google Doc to provide an example record to display.

In brief, output should look like this:

NWISDV:14113000 KLICKITAT RIVER NEAR PITT, WA. Observations on SurfaceWater, Air. Variables: Discharge, stream – Temperature, air From U.S. Geological Survey (USGS) NWISDV web service. Date range for site: 1909-07-01 to 2017-01-26.

Which would be constructed from this set of responses:

<location>
<Sitename>[https://waterdata.usgs.gov/nwis/uv/?site_no=14113000]. Observations on <samplemedium 1>[, samplemedium 2, samplemedium 3, …]. 
Variables: <conceptKeyword 1> - <conceptKeyword 2> - …
From [<GSERV:SourceOrg> - <ServCode>](<GSERV:http://hiscentral.cuahsi.org/pub_network.aspx?n=SourceId>) web service.
Date range for site: <series Min beginDate> to <series Max endDate>.

Please see the Goolge Doc for better formatting and additional info.

rajadain commented 7 years ago

Thanks @aufdenkampe. I just tried out GetSeriesCatalogForBox3 and GetServicesInBox2, and was able to confirm that the data format you suggest is derivable.

Since this set of information isn't a lot, were you thinking this to be in the "list" view or the "detail" view? It could very well fit in a list view.

I was unable to find examples listing multiple conceptKeywords or samplemediums, at least in the samples used in @emiliom's Jupyter notebook. Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?

And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV? Other ServCodes I encountered were GLDAS_NOAH, NLDAS_NOAH, and MOPEX.

One potential use of "variable linking" would be to filter results by the clicked variable. Other ideas may present themselves as we progress further along the implementation.

emiliom commented 7 years ago

I was unable to find examples listing multiple conceptKeywords or samplemediums, at least in the samples used in @emiliom's Jupyter notebook.

Try this: NWISUV:01474500. It has a suite of water quality sensors (pH, oxygen, turbidity), in addition to discharge, plus rainfall. It should give you a diverse set of results for testing. Plus it's in the Azavea neighborhood: USGS 01474500 Schuylkill River at Philadelphia, PA. You can browse it in "my" Monitor-My-Watershed pilot application: http://www.wikiwatershed-vs.org/Explorer?action=oiw:fixed_platform:USGS_01474500

If you'd like to examine it using my jupyter notebook, these request parameters worked for me:

bbox1 = (39.9, -75.2, 40.0, -75.1)
keyword = ''
start_date = '01/01/2016'
end_date   = '12/31/2016'

Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?

Neither here nor there. It just seems more obvious than a comma, plus some of the "variable" (conceptKeywords) strings include commas.

And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV?

Yes, but I believe you should also add NWISUV, and possibly NWISGW (both USGS services). No other services, for now.