bcgov / bcdata

An R package for searching & retrieving data from the B.C. Data Catalogue
https://bcgov.github.io/bcdata
Apache License 2.0
82 stars 13 forks source link

Make head/tail work with larger queries #212

Closed boshek closed 4 years ago

boshek commented 4 years ago

This PR fixes a bug that caused paginated requests to fail with head. For example this now works:

dh <- bcdc_query_geodata('2af1388e-d5f7-46dc-a6e2-f85415ddbd1c') %>%
    head(3) %>%
    collect()

by telling the bcdc_number_wfs_records function to look out for a count parameter and use that if it exists.

However, I am unable to make tail work. This does not currently work:

dh <- bcdc_query_geodata('2af1388e-d5f7-46dc-a6e2-f85415ddbd1c') %>%
    tail(3) %>%
    collect()

As far as I can see, the only difference is that a tail "query" includes a startIndex query parameter. I haven't yet been able to figure this out.

tail does work for a smaller query like this:

bcdc_query_geodata('hydrometric-stations-active-and-discontinued') %>% 
  tail(3) %>% 
  collect() 
boshek commented 4 years ago

There might be some sort of threshold to the startIndex as this works:

river <- bcdc_query_geodata('freshwater-atlas-linear-boundaries') %>% 
  tail(3)

river$query_list$startIndex <- 2000

collect(river)
boshek commented 4 years ago

One option is to modify tail like this:

tail.bcdc_promise <- function(x, n = 6L, ...) {
  number_of_records <- bcdc_number_wfs_records(x$query_list, x$cli)
  sorting_col <- pagination_sort_col(x$cols_df)
  x$query_list <- c(
    x$query_list,
    count = n,
    sortBy = sorting_col,
    startIndex = number_of_records - n
  )

  if (x$query_list$startIndex > 2000) stop("tail not available for large records", call. = FALSE)

  x
}
ateucher commented 4 years ago

Interesting. Does the change to bcdc_number_wfs_records have an impact on messages/print methods (i.e., if it short-cuts to the count parameter, when it says "This data set has n records, showing only the first 6", does n change? I can't remember if it's used there...

boshek commented 4 years ago

I don't think so. It is just head/tail that modify that message which is correct. So it still results in something like this:

R> bcdc_query_geodata('hydrometric-stations-active-and-discontinued') %>% 
   head(3)
Querying 'hydrometric-stations-active-and-discontinued' record
* Using collect() on this object will return 3 features and 17 fields
* At most six rows of the record are printed here
--------------------------------------------------------------------------------
Simple feature collection with 3 features and 17 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 1021765 ymin: 1304767 xmax: 1054676 ymax: 1384676
projected CRS:  NAD83 / BC Albers
# A tibble: 3 x 18
  id    HYDROMETRIC_STA~ STATION_NUMBER FEATURE_CODE STATION_NAME FLOW_TYPE WATERSHED_GROUP~ WATERSHED_ID STREAM_ORDER ARCHIVE_URL REALTIME_URL STATION_OPERATI~
  <chr>            <int> <chr>          <chr>        <chr>        <chr>     <chr>            <chr>        <chr>        <chr>       <chr>        <chr>           
1 WHSE~          2082661 07EA001        CF29300000   FINLAY RIVE~ NATURAL   TOOD             208          NA           https://wa~ NA           DISCONTINUED    
2 WHSE~          2082662 07EA002        CF29300000   KWADACHA RI~ NATURAL   FOXR             53           NA           https://wa~ NA           DISCONTINUED    
3 WHSE~          2082663 07EA004        CF29300000   INGENIKA RI~ NATURAL   INGR             69           NA           https://wa~ https://wat~ ACTIVE-REALTIME 
# ... with 6 more variables: CAPTURE_SCALE <chr>, START_DATE <date>, END_DATE <date>, OBJECTID <int>, SE_ANNO_CAD_DATA <chr>, geometry <POINT [m]>
R> bcdc_query_geodata('hydrometric-stations-active-and-discontinued')
Querying 'hydrometric-stations-active-and-discontinued' record
* Using collect() on this object will return 2306 features and 17 fields
* At most six rows of the record are printed here
--------------------------------------------------------------------------------
Simple feature collection with 6 features and 17 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 955923.4 ymin: 1055014 xmax: 1019837 ymax: 1159110
projected CRS:  NAD83 / BC Albers
# A tibble: 6 x 18
  id    HYDROMETRIC_STA~ STATION_NUMBER FEATURE_CODE STATION_NAME FLOW_TYPE WATERSHED_GROUP~ WATERSHED_ID STREAM_ORDER ARCHIVE_URL REALTIME_URL STATION_OPERATI~
  <chr>            <int> <chr>          <chr>        <chr>        <chr>     <chr>            <chr>        <chr>        <chr>       <chr>        <chr>           
1 WHSE~          2082784 08EC008        CF29300000   MORRISON RI~ NATURAL   BABL             5            NA           https://wa~ NA           DISCONTINUED    
2 WHSE~          2082785 08EC009        CF29300000   FULTON RIVE~ NATURAL   BABL             5            NA           https://wa~ NA           DISCONTINUED    
3 WHSE~          2082786 08EC010        CF29300000   BABINE LAKE~ NATURAL   BABL             5            NA           https://wa~ NA           DISCONTINUED    
4 WHSE~          2082787 08EC011        CF29300000   BABINE LAKE~ NATURAL   BABL             5            NA           https://wa~ NA           DISCONTINUED    
5 WHSE~          2082788 08EC012        CF29300000   BABINE LAKE~ NATURAL   BABL             5            NA           https://wa~ NA           DISCONTINUED    
6 WHSE~          2082789 08EC013        CF29300000   BABINE RIVE~ NATURAL   BABR             6            NA           https://wa~ https://wat~ ACTIVE-REALTIME 
# ... with 6 more variables: CAPTURE_SCALE <chr>, START_DATE <date>, END_DATE <date>, OBJECTID <int>, SE_ANNO_CAD_DATA <chr>, geometry <POINT [m]>
ateucher commented 4 years ago

Ok, great. There is a failing test unrelated to this PR - should we fix it here? (i.e., do you mind doing it? 😜 ). Looks like a column we were selecting before no longer exists in the data... (https://github.com/bcgov/bcdata/pull/212/checks?check_run_id=766044034#step:11:143)

ateucher commented 4 years ago

I think I didn't quite get what you did here before, but I think you nailed it. The tail issue is strange, so I guess your solution works in the interim?

ateucher commented 4 years ago

Regarding the tail issue, I ran the code you posted and it worked for me. So it may be somewhat flaky but possible to leave as is?

> dh <- bcdc_query_geodata('2af1388e-d5f7-46dc-a6e2-f85415ddbd1c') %>%
     tail(3) %>%
     collect()
Authorizing with your stored API key
> dh
Simple feature collection with 3 features and 17 fields
geometry type:  LINESTRING
dimension:      XY
bbox:           xmin: 1491824 ymin: 518350.2 xmax: 1521546 ymax: 553147.1
CRS:            3005
# A tibble: 3 x 18
  id    LINEAR_FEATURE_… WATERSHED_GROUP… EDGE_TYPE WATERBODY_KEY BLUE_LINE_KEY WATERSHED_KEY
* <chr>            <int>            <int>     <int>         <int>         <int>         <int>
1 WHSE…        832660866               78      1700     329216616     356564053     356564053
2 WHSE…        832660830               78      1700     329217730     356445345     356445345
3 WHSE…        832659811               78      1700     328941657     356566942     356566942
# … with 11 more variables: FWA_WATERSHED_CODE <chr>, LOCAL_WATERSHED_CODE <chr>,
#   WATERSHED_GROUP_CODE <chr>, DOWNSTREAM_ROUTE_MEASURE <chr>, LENGTH_METRE <dbl>,
#   FEATURE_SOURCE <chr>, FEATURE_CODE <chr>, OBJECTID <int>, SE_ANNO_CAD_DATA <chr>,
#   FEATURE_LENGTH_M <dbl>, geometry <LINESTRING [m]>