bcgov / bcdata

An R package for searching & retrieving data from the B.C. Data Catalogue
https://bcgov.github.io/bcdata
Apache License 2.0
81 stars 12 forks source link

Filter using string partial matching #227

Open bevingtona opened 3 years ago

bevingtona commented 3 years ago

Problem: I'd like to filter a query but I only know part of the string I am looking for...

bcdata::bcdc_query_geodata("bc-parks-ecological-reserves-and-protected-areas") %>%
  filter(stringr::str_detect(PROTECTED_LANDS_NAME, "SUGARBOWL.*PARK*"))

#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): object 'PROTECTED_LANDS_NAME' not found

Potential solution: I could download/collect only the column that I want to filter by, detect the sting that I am looking for and then use that string to filter my query.

my_var <- bcdc_query_geodata("bc-parks-ecological-reserves-and-protected-areas") %>%
  collect() %>% 
  st_drop_geometry() %>% 
  select(PROTECTED_LANDS_NAME) %>% 
  filter(stringr::str_detect(PROTECTED_LANDS_NAME, "SUGARBOWL.*PARK")) %>% 
  pull()

bcdc_query_geodata("bc-parks-ecological-reserves-and-protected-areas") %>%
  filter(PROTECTED_LANDS_NAME == my_var)

Problem with this solution: I need to download the geometries of the data set to get the column that I am interested in, which is essentially just downloading the entire dataset, so there is no processing speed advantage.

My question: Is there a way to drop the geometry before collecting the data? I suspect that this would be very fast and efficient for filtering large datasets using string detection.

boshek commented 3 years ago

@webgismd do you know if the WFS can return only the attributes and leave the geometry behind?

ateucher commented 3 years ago

I also wonder if you can partially match a string CQL?

bevingtona commented 3 years ago

I also wonder if you can partially match a string CQL?

@ateucher like this? solr_query=title: sugar but no idean about how to integrate this :) https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/search_index/queryTerms.html

boshek commented 3 years ago

So this works: https://openmaps.gov.bc.ca/geo/pub/wfs/?SERVICE=WFS&VERSION=2.0.0&REQUEST=GetFeature&outputFormat=application%2Fjson&typeNames=WHSE_TANTALIS.TA_PARK_ECORES_PA_SVW&SRSNAME=EPSG%3A3005&CQL_FILTER=PROTECTED_LANDS_NAME%20like%20%27%25BEAUMONT%25%27

and dbplyr can translate %like% :

dbplyr::translate_sql(PROTECTED_LANDS_NAME %like% "%BEAUMONT%")
#> <SQL> `PROTECTED_LANDS_NAME` like '%BEAUMONT%'

I think there is a path here.

ateucher commented 3 years ago

This also works: https://openmaps.gov.bc.ca/geo/pub/wfs/?SERVICE=WFS&VERSION=2.0.0&REQUEST=GetFeature&outputFormat=application/json&typeNames=WHSE_TANTALIS.TA_PARK_ECORES_PA_SVW&SRSNAME=EPSG:3005&CQL_FILTER=(strMatches(PROTECTED_LANDS_NAME, 'SUGARBOWL.*') = true)

Using strMatches from here: https://docs.geoserver.org/stable/en/user/filter/function_reference.html#string-functions.

But this doesn't work:

bcdata::bcdc_query_geodata("bc-parks-ecological-reserves-and-protected-areas") %>%
  filter(CQL("strMatches(PROTECTED_LANDS_NAME, 'SUGARBOWL.*') = 'true'"))

I think there's an issue sending as a POST vs GET request...

ateucher commented 3 years ago

Using strMatches we could make a grepl shim if we can figure out how to issue the query via POST. It might need to be added to the url rather than as a query parameter

ateucher commented 3 years ago

I changed the title of this issue since it's not actually possible to get just the attributes of a wfs layer...