BiG-CZ / BiG-CZ-Portal

Work towards developing the BiG CZ Data map-based web user interface.
https://portal.bigcz.org
2 stars 1 forks source link

Integrating Water Quality Portal (WQP) as new catalog #8

Open emiliom opened 6 years ago

emiliom commented 6 years ago
emiliom commented 6 years ago

@jkreft-usgs I'm moving the conversation over to this issue that's more general about wqp access, rather than pywqp proper. Basically in this issue I'm capturing notes and exchanges regarding wqp access & ingest.

And here is your comment from https://github.com/NWQMC/pywqp/issues/6#issuecomment-373115918 (thanks for your offer to help!):

yes, I would say to actually stay away from parsing the wqx xml if you can help it. It is deeply nested and kind of gnarly. If you start a separate project, I would be happy to contribute

I assume that means that the web service gives access to all elements of the wqx xml, but in simpler forms (csv/json)? Sorry, I haven't read the wqp web service doc page yet ...

Pasting my own last comment from that other issue:

I'll start going over the web service docs to get a better feel for this. Fun to hear that someone is using pyoos to hit your service operationally. I'm guessing that's PacIOOS. I probably won't try to use pyoos in the wikiwatershed portal (it comes with overhead that may be too much in that context), but there may be wqp code in there that I can reuse.

cc @aufdenkampe so he's aware of this exchange and notes

aufdenkampe commented 6 years ago

Yes, as I mentioned in https://github.com/NWQMC/pywqp/issues/6#issuecomment-374326731, We want something that very performant, so I like the idea of avoiding XML. Do you just use JSON or CSV responses or is there a more direct integration with Pandas data frames?

jkreft-usgs commented 6 years ago

I'll drop the same comment over here instead:

The fastest way to get data into a Pandas dataframe would be to stream the tsv into pandas, which is the approach we take to populate a redis cache quickly with a reasonable memory footprint. There is a tsv generator here that we call on a requests object with streaming=True here. That's obviously not using Pandas, but that's the basic approach I would take if I were doing this today.

Something to note is that it is good to pay attention to the total counts that come back in the http headers, since that is a decent basic checksum to ensure that at least all the rows you were expecting to come across the wire did in fact come across the wire

jkreft-usgs commented 6 years ago

The other thing to note- please do not spin up a multi-threaded download, it is not at all hard to DOS the WQP if you are trying to get a ton of data all at once.

emiliom commented 6 years ago

The other thing to note- please do not spin up a multi-threaded download, it is not at all hard to DOS the WQP if you are trying to get a ton of data all at once.

We'll keep that in mind. Thanks also for the additional info about tsv and streaming access.

jkreft-usgs commented 6 years ago

You might look into some tooling that (bizarrely enough) an old roommate of mine who is now working at Anaconda is doing around streaming tabular data into Pandas: http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1

emiliom commented 6 years ago

I've put together a simple Jupyter notebook demonstrating GET requests for the WQP Stations endpoint, and browsing / examining the results. It's here

For convenience, comparison and coherence, I reused the 1deg x 1 deg AOI box scheme from #10. @aufdenkampe, you may be tickled to see that there's Stroud data in the this PA/NJ/DE AOI, in the WQP!

emiliom commented 6 years ago

I forgot to ping @lsetiawan

emiliom commented 6 years ago

One more thing, mainly geared to @lsetiawan for now, but ultimately for discussion later with the rest of you (@aufdenkampe and @jkreft-usgs): The WQP API's are like the CUAHSI HIS catalog API's (and unlike the CINERGI and HydroShare API's) in that -- as far as I can tell -- they don't have the capability to search on free text. They have a bunch of vocabularies ("domains") that can be searched on, but using them in the search would run us into the same kind of problems with the CUAHSI API. So, the initial implementation that @lsetiawan and I are working on will ignore the text entered into the search box on the Wikiwateshed App. The parameters issued to the Stations API endpoint will be only the ones shown in cell 8 of the jupyter notebok (and mimeType and sorted will be fixed, so only bBox will change).

lsetiawan commented 6 years ago

I was able to use the ipynb by @emiliom and integrate to the portal on my local dev instance of the WQP Catalog metadata for each station. See below for screenshots.

Initial crack at integrating WQP to BiGCZ Portal. ![screenshot from 2018-04-06 18-26-05](https://user-images.githubusercontent.com/17802172/38449886-fb2eaeda-39c9-11e8-87ce-778ec17ecda5.png) ![screenshot from 2018-04-06 18-25-37](https://user-images.githubusercontent.com/17802172/38449887-fb4568d2-39c9-11e8-89b1-bbb58c18aa50.png)
jkreft-usgs commented 6 years ago

You are correct that WQP does not have a free text search.

A couple recommendations/ideas/requests-

emiliom commented 6 years ago

Thanks @jkreft-usgs. We'll make sure to use minresults=1 (@lsetiawan, please add it in your tests). BTW, that parameter is not listed in the web service guide. Can you say more about what the "extraneous USGS sites" are? That is, what exactly are we excluding?

Regarding the geospatial query requests, I can tell you that all requests by the application will be geospatial queries! However, it'll be quite a while (2 months?) before we're ready to go live, so before that all queries will be simply during tests and development. We'll keep you informed every step of the way, as we make significant progress and have more questions.

Finally: @lsetiawan thanks for sharing your progress via this issue!! That's really awesome.

I'm on vacation for the next week, so I won't be commenting for a while.

jkreft-usgs commented 6 years ago

ok, so there is no chance for the app to be hydrologically aware and use something like HUC instead? The sites that are eliminated with minresults are sites that have no data associated with them.

lsetiawan commented 6 years ago

ok, so there is no chance for the app to be hydrologically aware and use something like HUC instead?

I think that this is possible since the app is able to use HUC boundaries to do its modeling. So if somehow the frontend can spit out those HUC ID to the backend, we should be able to query WQP using the huc query parameter.

The sites that are eliminated with minresults are sites that have no data associated with them.

Thanks @jkreft-usgs! Wow, it really filtered a lot of sites! :smile:

aufdenkampe commented 6 years ago

I think it's possible that we could identify all the HUC12s that overlap with a WikiWatershed web app user's Area of Interest, fetch those, then do spatial cropping/filtering on our side similar to what we've done for our our WDC spatial searches.

However, we do not presently have that capability within the WikiWatershed API: https://app.wikiwatershed.org/api/docs/, although we've been talking about adding such functionality for quite some time.

lsetiawan commented 6 years ago

I think it's possible that we could identify all the HUC12s that overlap with a WikiWatershed web app user's Area of Interest, fetch those, then do spatial cropping/filtering on our side similar to what we've done for our our WDC spatial searches.

@aufdenkampe Hmm... that's interesting to do. I think it would be really cool.

However, we do not presently have that capability within the WikiWatershed API: https://app.wikiwatershed.org/api/docs/, although we've been talking about adding such functionality for quite some time.

I see how Azavea is getting their Huc information. I now have a simple way to get the huc id! Now I think I will try to do the spatial cropping/filtering you're talking about, and see if I can implement this. Stay tuned! :smile:

lsetiawan commented 6 years ago

So this is the query I have now using huc codes rather than bbox as @jkreft-usgs suggested: https://www.waterqualitydata.us/data/Station/search?mimeType=csv&sorted=no&minresults=1&zip=yes&huc=020402020302;020402020401;020402020305;020402020403;020402020405;020402031008

That's from a rectangle-ish aoi that I drew. screenshot from 2018-04-11 09-49-19

lsetiawan commented 6 years ago

One question I have for @jkreft-usgs is the root search url, should it be https://www.waterqualitydata.us/data/Station/search or https://www.waterqualitydata.us/Station/search

jkreft-usgs commented 6 years ago

@lsetiawan Both work, but /data is more future focused- we are getting ourselves out of URL mapping difficulties as we keep adding more endpoints

emiliom commented 6 years ago

Thanks @lsetiawan and @jkreft-usgs for the work on HUC-based searching. Cool!

@lsetiawan, keep in mind that we'll still need bbox-based searching, b/c the WikiWatershed app provides user options for AOI polygons: HUCs, free-draw polygon vertices, and other polygon types. So, HUC searching is only a special subset, though it might be one of the most common ones. Handling both types may require a complex logic to issue slightly different search requests depending on whether or not the AOI was selected based on HUCs.

Personally, I'd much rather focus on refining the WQP search results, and only then go back to HUC search customization, building on what you've already done. I'll get back to the former when I'm back, but most likely not until the week after next.

lsetiawan commented 6 years ago

So, HUC searching is only a special subset, though it might be one of the most common ones. Handling both types may require a complex logic to issue slightly different search requests depending on whether or not the AOI was selected based on HUCs.

It's doing both searching... If you use huc, it'll do a search only on that huc. And if you use free-draw and other, it will search on all the hucs that has intersections to that AOI. But at the end it gets filtered and you only get locations within the AOI. I think what @jkreft-usgs said previously is that passing hucid's is better than doing actual bbox geospatial search right now.

emiliom commented 6 years ago

@jkreft-usgs I've run some tests for the Result and Activity API, to start familiarizing myself with them. I have a couple of questions:

Thanks!

jkreft-usgs commented 6 years ago

Good questions. You can see the different data elements in the documentation. https://www.waterqualitydata.us/portal_userguide/ The default result output does indeed include many activity elements, because until recently, there were only two endpoints- result and station, and for result to make any sense, it needed sampling activity data. However, there is a result output that is just result information, which you access with dataProfile=narrowResult. https://www.waterqualitydata.us/data/Result/search?statecode=US%3A55&countycode=US%3A55%3A025&siteType=Stream&mimeType=csv&zip=yes&sorted=no&dataProfile=narrowResult

A service that we are working on for this year will be a summary service, which will hopefully help with this exact use case. Right now, if you want to get an overview of data at a site, you really have do do quite a lot of crunching first.

emiliom commented 6 years ago

Thanks @jkreft-usgs.

I tried dataProfile=narrowResult. While it did eliminate several Activity attributes in the responses, it actually added many new attributes. The response now has 78 attributes instead of 63! Based on a very quick inspection, it looks like it added a number of biological attributes. Also, the narrowResult option is not described or even mentioned in the WQP API guide; the only dataProfile parameter described there is biological.

So, it looks like using this option does more damage than not using it :disappointed:

A service that we are working on for this year will be a summary service, which will hopefully help with this exact use case.

This would be fantastic, and is exactly what we're looking for!

emiliom commented 6 years ago

Follow up thoughts ... Maybe dataProfile=narrowResult is just not working as intended right now? Its current behavior is a mix of what you described (ie, reduced "activity" information) plus dataProfile=biological. If that's the case, fixing this problem in the near term would be great!

Still, it looks like some of the "activity" information that's dropped would be very helpful for a summary/discovery service. ActivityMediaName (eg, Water) and ActivityMediaSubdivisionName (eg, Surface Water) come to mind.

emiliom commented 6 years ago

Some follow-up results.

I ran a Result API request using a bBox (bounding box) criteria instead of the individual station (siteid) criteria we'd been using so far. The bounding box is the same 1deg x 1 deg box we've been using for Station API requests: BBOX: (39.6, -76.0, 40.6, -75.0). AOI area: 9457 km2. Other parameters used are: mimeType=tsv, sorted=no, zip=yes

The request returned 1,570,698 records. Getting the results took 8 minutes. Converting to a pandas data frame (which includes unzipping) took another minute, maybe less.

For reference, the Station API bbox request for the same bBox returned 4,995 records and took 20 seconds.

The next thing to try, to speed up the response, is @jkreft-usgs 's recommendations:

The fastest way to get data into a Pandas dataframe would be to stream the tsv into pandas, which is the approach we take to populate a redis cache quickly with a reasonable memory footprint. There is a tsv generator here that we call on a requests object with streaming=True here. That's obviously not using Pandas, but that's the basic approach I would take if I were doing this today.

You might look into some tooling that (bizarrely enough) an old roommate of mine who is now working at Anaconda is doing around streaming tabular data into Pandas: http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1

Still, while that will possibly benefit users by not having to wait a long time before some results are shown, I would imagine it won't dramatically cut down the total response time.

It's also likely that HUC-based (as opposed to bbox-based) searches will be much faster, based on what Jim has told us. But given that HUC searching is only one of several spatial search options in the Wikiwatershed App, this is not a great solution. Still, we should do some benchmarks.

@jkreft-usgs, have you had a chance to look into what dataProfile=narrowResult is doing (what I reported on a couple of days ago)? Hopefully you'll find that it really is misbehaving, and can fix it in the near term! And hopefully that will lead to a noticeable speed up in the Result response.

Regardless, if we want the richness of metadata available in the Result API (and not available in the Station API), we'll likely have to impose a small AOI threshold, likely on the order of 1,500 km2 or less. The alternative is to redesign how the request is issued and processed on our App end, with significant impacts on the filtering capability we can offer. (plus a level of effort involved that may be beyond our near/mid-term grasp)

emiliom commented 6 years ago

A couple of notes to self (and Don) for reference and use later on.

Linking to WQP granular resources and information

Recent, nice publication about WQP

Read, E. K., Carr, L., De Cicco, L., Dugan, H. A., Hanson, P. C., Hart, J. A., Kreft, J., Read, J., Winslow, L. A. (2017). Water quality data for national-scale aquatic research: The Water Quality Portal. Water Resources Research, 53(2), 1735–1745. https://doi.org/10.1002/2016WR019993

aufdenkampe commented 6 years ago

@emiliom, thanks for sharing those "notes to self". They're helpful for me to start exploring WQP and its metadata.

emiliom commented 6 years ago

@aufdenkampe: glad you found that useful. @lsetiawan is already working on implementing that new information into the detailed results view.

An update on Result API responses. I ran the same queries I mentioned yesterday, except using an AOI a quarter of the size (0.5° x 0.5°). The new AOI is BBOX: (39.85, -75.75, 40.35, -75.25), AOI area: 2364 km2 and shares the same center coordinate as the previous one.

The request returned 357,176 records. Getting the results took a bit over 5 minutes (plus the time taken to convert to a pandas data frame, which includes unzipping). Compared to the previous 1° x 1° request, this AOI that's a quarter of the size returned a fifth of the records but took half the time, not 1/4 or 1/5 of the time! Darn.

The Station API bbox request for the same bBox returned 1,408 records and took 11 seconds.

jkreft-usgs commented 6 years ago

@emiliom It looks like my response was lost in too many tabs! The narrowResult data profile is working as expected, it is just different from what you might be expecting. You can see the different content of the data profiles here: https://www.waterqualitydata.us/portal_userguide/

Essentially the narrowResult data profile is named that because it is almost exclusively content from the "Result" part of the WQX data model, whereas the "default" was a mix of Result and activity. Now that we serve Activity information separately, it makes more sense to just serve that information separately, at least for some use cases...

Also, it looks like you might find the domain value services useful, which you can see here:

https://www.waterqualitydata.us/webservices_documentation/#WQPWebServicesGuide-Domain

emiliom commented 6 years ago

Thanks for the follow-up @jkreft-usgs (I know first hand about the situation of too many tabs and github issue responses not being finalized!)

Regarding narrowResult, I went through the documentation and my own API tests, and unless I'm missing something, the set of attributes returned are not simply a subset of the attributes returned by the unqualified Result API. It's really a subset (with most Activity attributes removed) of the Biological profile. You can confirm this easily by seeing that a bunch of "TaxonomicDetailsCitation*" attributes are returned by both the narrowResult and Biological profiles, but not by the unqualified response. Because of this, narrowResult actually returns more attributes than the unqualified, default Result request (78 attributes instead of 63).

jkreft-usgs commented 6 years ago

yes indeed. It might be easier to understand this in a call, but here goes. WQX has a number of top-level domains in its data model

We are working toward having endpoints for all of these domains (along with additional subdomains) in WQP. However, WQP used to try to do everything with only 2 endpoints, Result and Station, and everything needed to be crammed into those two endpoints. Station had some stuff from Organization and Monitoring location, and Result had a mix of key elements from Activity, and Result- enough to describe most physical and chemical samples, but not biological samples. We first added the "biological" data profile to WQP, which added a few dozen additional columns to the result endpoint- it is basically the kitchen sink. Then we added the "Activity" endpoint, which meant that we didn't need to serve the activity data over and over again, a could drop a whole pile of columns from result- hence "narrowResult" which has almost exclusively columns from the result domain. However, to make real sense of that narrowResult data, you do need to also get activity data.

Clear as mud?

The next step is to deprecate the existing data profiles in favor of ones that will best support the user community, and also add more effective summary services.

lsetiawan commented 6 years ago

I have updated the details page of the USGS WQP Catalog. Users are able to go to the URL's mentioned in https://github.com/BiG-CZ/BiG-CZ-Portal/issues/8#issuecomment-385904756 and also download the sample results in a click of a button.

screenshot from 2018-05-02 15-23-08

emiliom commented 6 years ago

@lsetiawan thanks for the screenshot and the enhancements you've implemented!

@jkreft-usgs thanks for the explanations and background. I think we're on the same page now. The historical sequence may be clear as mud, but the current situation is clear. Still, it does mean that "narrowResult" doesn't offer much of a performance/payload advantage relative to the current default, unqualified Result request (except in having the biological results already in, if one wanted those gory details).

I'll chew on this early next week, to get a better sense of near-term options for the Wikiwatershed App.