BiG-CZ / BiG-CZ-Portal

Work towards developing the BiG CZ Data map-based web user interface.
https://portal.bigcz.org
2 stars 1 forks source link

CUAHSI WDC catalog API search enhancements #10

Open emiliom opened 6 years ago

emiliom commented 6 years ago

Goal: Find WDC sites & data series in larger, more useful AoI's.

Revisit choice of catalog search API requests, to explore newer ones that are faster, more flexible and more effective.

Background / research

emiliom commented 6 years ago

We will explore the new web services at CUAHSI: http://hiscentral.cuahsi.org/webservices/hiscentral.asmx

emiliom commented 6 years ago

Notes about how the MMW CUAHSI WDC currently operates:

Background discussions from 2017, during development:

emiliom commented 6 years ago

It'd be really nice if there was a catalog API operation that excluded grid services. OR if one of the existing operations had a parameter that allowed for the exclusion of grid services.

emiliom commented 6 years ago

Here are initial results from an assessment today using a jupyter notebook I'll post later. I'll post more details later, too.

Each result is for a search based on a 1° x 1° square box ("square" in lat-lon coordinates) centered at the center point listed. Search requests were issued with suds-jurko. The last 3 columns show response times (including suds processing time) for 3 API's:

Location latlon center AOI (km2) series count non-grid series count GSCFB2 GSCFB3 GSMCD
Texas, south of Austin 30.0, -97.5 10,707 5,288 4,488 20.5 s 53.0 s 36.9 s
Just N of the Schuykil river near Philly 40.1, -75.5 9,457 23,001 22,205 86.0 s 181.0 s 178.0 s
1° N of the above PA/DRB point 41.1, -75.5 9,317 16,744 15,944 60.0 s 110.0 s 128.0 s
Central Iowa 42.0, -93.0 9,188 1,618 818 6.77 s 12.4 s 11.2 s
Halfway between Olympia, WA and Portland, OR 46.5, -123.0 8,511 9,226 8,426 44.7 s 73.0 s 69.0 s
emiliom commented 6 years ago

Just realized that the HIS API's (or at least GetSeriesMetadataCountOrData) also accept GET and POST requests, not just SOAP. I don't know if that makes any difference in performance, though.

emiliom commented 6 years ago

The Jupyter notebook I used for this assessment, CUAHSI_HISCentral_AOI_service_tests.ipynb, can be accessed here. See the descriptions at the top.

This notebook was run once for each AOI listed in the table above. The specific results shown in the notebook snapshot (for the "1° N of the above PA/DRB point" AOI) differ from the ones listed in the table, because the data are dynamic and factors such as CUAHSI server loads and network latency are not constant. The results in the notebook were run today, Monday April 2 at 3:40pm PT, while results in the table above were run on Saturday March 24 (weekend server loads are probably lighter).

emiliom commented 6 years ago

Extra notes I jotted down while composing the MMW issue I just created. Too much detail to include in that issue, but worth capturing here for easy reference.

  1. https://github.com/WikiWatershed/model-my-watershed/pull/2409 tests with 8000km2; reported problems with suds
    • "I'll take a deeper look soon, but the main reason for going with 1500 for the limit was WDC searches timing out. It may be that we need to increase the timeout for those somewhere, or constrain the results otherwise (like limiting to the last five years for example) to make it finish in time, and pair that with this to be viable. I'll know more once I've taken a look."
      • My comment: The timeout was not on the WDC end. The "timeout" issue was the WDC response time exceeding a limit imposed by our application
    • "I tested this, but WDC results choke on even a 2000 sq km area of interest."
      • My comment: The choking is internal to the application handling of WDC responses
      • My comment: Comment discussed some specific code within the application that was throwing this problem, and potential solutions
  2. https://github.com/WikiWatershed/model-my-watershed/pull/2418 Increases size of area of interest to 8000. Limits BiG-CZ searches to the last five years by default whenever area of interest is bigger than 1500 km²."
    • "The commit 76e0eca alleviates a CPU and RAM utilization issue where a large number of CUAHSI results would fail to serialize to JSON in order to cache. Now we don't cache the CUAHSI results. However, suds itself chokes at a certain point (around 800 or so results). Thus the time limit."
      • My comment: I did not encounter any "suds" problems, even up to 23K records, and with my laptop that's not top-of-the line hardware
      • My comment: Was he using the old and deprecated (abandoned) "suds"? That package is known to have problems. Use its active fork, suds-jurko
aufdenkampe commented 6 years ago

@emiliom, thanks for all your effort at testing, documenting, and finding likely paths to solve the WDC site search performance issue.