Open emiliom opened 7 years ago
Thanks for looking in to this. Do we want to display series or sites? Our current implementation uses GetSitesInBox2
combined with GetServicesInBox2
(Source). It doesn't look it would be difficult to use GetSeriesCatalogForBox3
instead though.
Ah, I didn't fully catch that. Got it.
Using GetSitesInBox2
combined with GetServicesInBox2
for now seems ok. @aufdenkampe and I should first discuss what should be displayed in the results; depending on what we decide, a series response may make more sense.
BTW, for future consideration: it occurred to me that currently there are only 94 "data services" in WDC (ie, the maximum total that can be returned by GetServicesInBox2
), and those change rarely. Maybe a future strategy for better search responsiveness could be to get all services in the US lower 48 (based on a GetServicesInBox2
query) and cache them once a day, then just use that cache together with the actual query results from GetSitesInBox2
.
Because there are so few services, the GetServicesInBox2
request is very fast. For the moment, this isn't a performance bottleneck. But to avoid making unnecessary requests, and for the reasons you described, it may be a good idea to cache the results anyway. I'll create an issue for this.
One advantage I see in using GetSeriesCatalogForBox3
is that the results contain conceptKeyword
which we can use for client-side filtering. Neither GetServicesInBox2
nor GetSitesInBox2
exposes this field.
One advantage I see in using GetSeriesCatalogForBox3 is that the results contain conceptKeyword which we can use for client-side filtering.
Just for reference, this is a broader difference between the GetSeriesCatalogForBox*
queries and your current approach; it's not a unique feature of GetSeriesCatalogForBox3
per se.
A GetSeriesCatalogForBox*
request that doesn't specify a keyword
will get multiple series per site. Client-side processing could be used to make the response more user friendly (I think) by grouping the different series into a single site "dataset" record. But that's getting into details that Anthony and I should probably discuss first.
Yes! I agree with Emilio that we want to focus on the information in the GetSeriesCatalogForBox*
, but group by Sitename
. He and I should discuss this first, and soon.
The hierarchy of this information model here is that:
VarCode
& VarName
), along with it's associated metadata (beginDate
, samplemedium
, Speciation
, MethodDesc
, etc.).We want to organize our returns by Sites, but display (and search) the info on all the Series at each Site.
FYI, specially for Anthony: I've updated the notebook gist I mentioned at the start of this issue, to include examples from both GetSeriesCatalogForBox2
and GetSeriesCatalogForBox3
@kdeloach and @ajrobbins , I had a conversation with @emiliom on Friday where we carefully explored the best approach for searching CUAHSI Water Data Center (WDC) and compiling the output provided to the user. We used these resources to inform our discussion:
The approach below resolves the open question about whether to use GetSitesInBox2 or GetSeriesCatalogForBox3. See #1858 and #1931 for more details. This will allow us to move forward on the CUAHSI WDC #1945. @emiliom can add where necessary.
Azavea should run a GetSeriesCatalogForBox2
GET request each time a user does a dataset search
GetSeriesCatalogForBox3
because we're worried about query response time, and Box2 seems to provide just the right amount of metadata. (Note: While GetSeriesCatalogForBox2
does not have VariableUnitsAbbrev and GetSeriesCatalogForBox3
does, we're confident this will not be a problem because we only need units info once we fetch the actual data values. Let's reevaluate after we start fetching data.)Azavea should run a GetServicesInBox2
GET request once per day (or week?) for the entire world, and save the results to be merged with the returned results from GetSeriesCatalogForBox2
GET requests. This is captured in issue #1932.
Azavea should NOT run any GetSitesInBox2*
GET requests, because all relevant site info is included in each GetSeriesCatalogForBox2
GET request.
Combine/merge all records by site ('location' code) from a GetSeriesCatalogForBox2
GET request.
a. These fields, below, will be the same for each SeriesRecord with the same 'location' code
b. These fields, below, will be different for each SeriesRecord with the same 'location' code, and will be grouped by VarCode
Append each of these new SiteRecords (created above in B1 ) with the associated ServicesRecord metadata that was saved from A2, above, using the ServCode / ServiceID (there's a 1:1 map for these).
These changes have been implemented in PR #1959. Check out the screenshots to compare the differences.
Notes:
GetSeriesCatalogForBox2
produces a greater volume of results, but the amount of metadata available hasn't increased much, compared to using GetSitesInBox2
. The only fields common to series records are: ServCode
, ServURL
, Sitename
, location
, latitude
, longitude
, beginDate
, and endDate
. These are the fields we will expose from our API.
We still don't have access to these fields for each resource:
beginDate
-- not sure if this is correct)This is a known issue, but the beginDate
and endDate
filters don't seem to do anything. I get the same results no matter which dates I try.
We can dynamically generate URLs for each resource, if necessary. However, we don't need to generate URLs for services, since that is already available from the ServiceDescriptionURL
field from GetServicesInBox2
.
GetSeriesCatalogForBox2
produces a greater volume of results,
Yes, that's expected, as Anthony has mentioned above.
but the amount of metadata available hasn't increased much, compared to using
GetSitesInBox2
There's much more metadata coming in! See @aufdenkampe's comment above. Maybe what you mean is that there isn't much more metadata for the subset of metadata defined in your common dataset record metadata? Assuming this interpretation I'm making is correct, I guess that would be true b/c that dataset record metadata did not encompass the additional information Anthony listed in B.1.b that comment above (except for beginDate and endDate).
We still don't have access to these fields for each resource:
- author
- created date (currently, this field is populated with beginDate-- not sure if this is correct)
- updated date
These do not exist in the WDC response, per se. But that should be ok.
Depending on how author
is used, the service provider (derived from ServCode
and the results of GetServicesInBox2
) could be used for it.
This is a known issue, but the beginDate and endDate filters don't seem to do anything. I get the same results no matter which dates I try.
Ok. I guess this depends on the roll-out of the fixed WDC Catalog API, which hadn't been released as of June 7.
With Kevin gone, I don't know if @rajadain is now automatically pinged. So I'm pinging him here.
Thanks for pinging me @emiliom, I'll subscribe to all issues created so far so that I'm notified. I'll go through the discussion and respond here shortly.
After reading through the comments in #1959, I think it's clearer we have a misunderstanding about what Anthony's and my intent was. It's clear that in that PR, the metadata specific to a "series" (which is sort-of a synonym for "variable" is thrown out.
Anthony and I will submit a much more specific request/recommendation for what should be shown in the WDC dataset record boxes on the UI.
Thanks, we'll wait for that.
Just a couple of references, for future use:
GetSeriesCatalogForBox2
and GetSeriesCatalogForBox3
n
value is the service or source ID, SourceId
. The information on this page should be the same information available from a GetServices*
request (eg, GetServicesInBox2
); Anthony mentioned this in his long comment.@rajadain, we just created the Sample_WDC_Site_Record_BiGCZPortal_SearchResult Google Doc to provide an example record to display.
In brief, output should look like this:
NWISDV:14113000 KLICKITAT RIVER NEAR PITT, WA. Observations on SurfaceWater, Air. Variables: Discharge, stream – Temperature, air From U.S. Geological Survey (USGS) NWISDV web service. Date range for site: 1909-07-01 to 2017-01-26.
Which would be constructed from this set of responses:
<location>
<Sitename>[https://waterdata.usgs.gov/nwis/uv/?site_no=14113000]. Observations on <samplemedium 1>[, samplemedium 2, samplemedium 3, …].
Variables: <conceptKeyword 1> - <conceptKeyword 2> - …
From [<GSERV:SourceOrg> - <ServCode>](<GSERV:http://hiscentral.cuahsi.org/pub_network.aspx?n=SourceId>) web service.
Date range for site: <series Min beginDate> to <series Max endDate>.
Please see the Goolge Doc for better formatting and additional info.
Thanks @aufdenkampe. I just tried out GetSeriesCatalogForBox3
and GetServicesInBox2
, and was able to confirm that the data format you suggest is derivable.
Since this set of information isn't a lot, were you thinking this to be in the "list" view or the "detail" view? It could very well fit in a list view.
I was unable to find examples listing multiple conceptKeyword
s or samplemedium
s, at least in the samples used in @emiliom's Jupyter notebook. Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?
And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV
? Other ServCode
s I encountered were GLDAS_NOAH
, NLDAS_NOAH
, and MOPEX
.
One potential use of "variable linking" would be to filter results by the clicked variable. Other ideas may present themselves as we progress further along the implementation.
I was unable to find examples listing multiple conceptKeywords or samplemediums, at least in the samples used in @emiliom's Jupyter notebook.
Try this: NWISUV:01474500
. It has a suite of water quality sensors (pH, oxygen, turbidity), in addition to discharge, plus rainfall. It should give you a diverse set of results for testing. Plus it's in the Azavea neighborhood: USGS 01474500 Schuylkill River at Philadelphia, PA. You can browse it in "my" Monitor-My-Watershed pilot application:
http://www.wikiwatershed-vs.org/Explorer?action=oiw:fixed_platform:USGS_01474500
If you'd like to examine it using my jupyter notebook, these request parameters worked for me:
bbox1 = (39.9, -75.2, 40.0, -75.1)
keyword = ''
start_date = '01/01/2016'
end_date = '12/31/2016'
Is separating them with a "–" common practice, or simply an alternative given that "," is included in the value?
Neither here nor there. It just seems more obvious than a comma, plus some of the "variable" (conceptKeywords
) strings include commas.
And, just to confirm, the site name links should only be generated for those that have ServCode = NWISDV?
Yes, but I believe you should also add NWISUV, and possibly NWISGW (both USGS services). No other services, for now.
@kdeloach, I think you (and I) have so far not done much testing on the WDC
GetSeriesCatalogForBox3
API query. My "series" queries have focused onGetSeriesCatalogForBox2
(eg, this notebook). But it turns out thatGetSeriesCatalogForBox3
has two important advantages:sampleMedium
,dataType
andvalueType
. Out of those, probably onlysampleMedium
is actually valuable for our likely use casesI ran a test, and
GetSeriesCatalogForBox3
seems to work fine, though probably with the same current bugs and keyword limitations as inGetSeriesCatalogForBox2
.cc @aufdenkampe