bcgov / bcdata

An R package for searching & retrieving data from the B.C. Data Catalogue
https://bcgov.github.io/bcdata
Apache License 2.0
81 stars 12 forks source link

JOSS Review: Usability comments #239

Closed MilesMcBain closed 3 years ago

MilesMcBain commented 3 years ago

Hi,

Congratulations on putting together a very polished and usable interface to your province's open data!

I have a some usability feedback arising from my in-progress JOSS review, and there will probably be more, so let's keep this thread open.

Unless otherwise stated I don't believe any of these issues are blockers for JOSS acceptance.

UF1: Is it possible to either display the file size of resources associated with the record, the proportion of file size downloaded with the progress bar, or prompt for confirmation when a download is about to be initiated above some threshold?

For example:

bcdc_get_data("e2dadc60-292f-4d98-b42b-56ca9e4fe694") lets me know that "This record requires pagination to complete the request." and displays a progress bar while download completes. But it's hard to know if I have made a reasonable request or if it will return in a reasonable amount of time since I do not know how large the data are, or what portion of the file size the progress represents.

UF2: It would be convenient to be able to fetch the accompanying metadata for some data with a bcdc_ function.

Consider the Fire location data: https://catalogue.data.gov.bc.ca/dataset/fire-incident-locations-historical, it has a helpful data description table under "Object Description". An example of how this could be useful: facilitate comparisons between the column data types chosen by R's parsers vs the native data types.

Edit: linking back to main review thread: https://github.com/openjournals/joss-reviews/issues/2927

ateucher commented 3 years ago

Hi @MilesMcBain - great to have you reviewing the paper, and thanks so much for making the time and effort to post comments here as well.

For your first point regarding file sizes/expected download time, that is a really good observation and adding this kind of information would help a lot. I've opened a separate issue for us to track that (#240).

UF2: We do have bcdc_get_record() and bcdc_describe_feature():

library(bcdata)
#> 
#> Attaching package: 'bcdata'
#> The following object is masked from 'package:stats':
#> 
#>     filter

bcdc_get_record("fire-incident-locations-historical")
#> Authorizing with your stored API key
#> Warning: It is advised to use the permanent id ('e2dadc60-292f-4d98-b42b-56ca9e4fe694') rather than the name of the record ('fire-incident-locations-historical') to guard against future name changes.
#> B.C. Data Catalogue Record: Fire Incident Locations - Historical
#> Name: fire-incident-locations-historical (ID:
#>  e2dadc60-292f-4d98-b42b-56ca9e4fe694)
#> Permalink:
#>  https://catalogue.data.gov.bc.ca/dataset/e2dadc60-292f-4d98-b42b-56ca9e4fe694
#> Sector: Natural Resources
#> Licence: Open Government Licence - British Columbia
#> Type: Geographic
#> Last Updated: 2021-01-01
#> Description: Wildfire historic incident point locations for all fire
#>  seasons before the current season.  Supplied through various sources.
#>  Not to be used for legal purposes.  This data includes all incidents
#>  tracked by BC Wildfire Service, ie. actual fires, suspected fires,
#>  nusiance fires, smoke chases, etc.  On April 1 of each year this layer
#>  is updated with the previous fire season's data
#> Resources: (3)
#> # A tibble: 3 x 8
#>   name     url        id      format ext   package_id  location bcdata_available
#>   <chr>    <chr>      <chr>   <chr>  <chr> <chr>       <chr>    <lgl>           
#> 1 WMS get… https://o… eb2fff… wms    ""    e2dadc60-2… bcgeogr… TRUE            
#> 2 KML Net… http://op… 012b8c… kml    "kml" e2dadc60-2… bcgeogr… FALSE           
#> 3 BC Geog… https://c… 6db589… other  ""    e2dadc60-2… bcgeogr… FALSE           
#> You can access the 'Resources' data frame using bcdc_tidy_resources()

bcdc_describe_feature("fire-incident-locations-historical")
#> Authorizing with your stored API key
#> # A tibble: 18 x 4
#>    col_name               sticky remote_col_type       local_col_type
#>    <chr>                  <lgl>  <chr>                 <chr>         
#>  1 id                     FALSE  xsd:string            character     
#>  2 FIRE_NUMBER            FALSE  xsd:string            character     
#>  3 FIRE_YEAR              FALSE  xsd:decimal           numeric       
#>  4 IGNITION_DATE          TRUE   xsd:date              date          
#>  5 FIRE_CAUSE             TRUE   xsd:string            character     
#>  6 FIRE_LABEL             TRUE   xsd:string            character     
#>  7 FIRE_CENTRE            FALSE  xsd:decimal           numeric       
#>  8 ZONE                   FALSE  xsd:decimal           numeric       
#>  9 FIRE_ID                FALSE  xsd:decimal           numeric       
#> 10 FIRE_TYPE              TRUE   xsd:string            character     
#> 11 GEOGRAPHIC_DESCRIPTION TRUE   xsd:string            character     
#> 12 LATITUDE               TRUE   xsd:decimal           numeric       
#> 13 LONGITUDE              TRUE   xsd:decimal           numeric       
#> 14 CURRENT_SIZE           TRUE   xsd:decimal           numeric       
#> 15 FEATURE_CODE           TRUE   xsd:string            character     
#> 16 SHAPE                  TRUE   gml:PointPropertyType sfc geometry  
#> 17 OBJECTID               FALSE  xsd:decimal           numeric       
#> 18 SE_ANNO_CAD_DATA       TRUE   xsd:hexBinary         numeric

However, I see now that doesn't contain as much information as the "Object Description". That information is available via the catalogue API, so shouldn't be difficult to add and would be helpful, as you say.

MilesMcBain commented 3 years ago

UF3: Should you state explicitly in the documentation that the format of a resource needs to be wms to use bcdc_query_geodata?

The help for bcdc_query_geodata says:

Queries features from the B.C. Web Service. The data must be available as a Web Service. ...

And the closest statement I have found is in the Querying Spatial Data with bcdata vignette:

This data is the boundary of each school district. The key information in this metadata is that the layer is a **WFS request (Spatial Data)**---which means it is available through a Web Service. From this we know we can make use of bcdc_query_geodata.

I am 95% sure this corresponds to the wms format based on the content of the Web Service link. If true, this could be made more explicit by changing that para to something like:

This data is the boundary of each school district. The key information in this metadata is that the available resource has format **wms**---which means it is available through a Web Maps Service. From this we know we can make use of bcdc_query_geodata.

ateucher commented 3 years ago

Thanks for this as well - we've struggled to keep those terms consistent across the documentation (and what level of abstraction to expose the user to - i.e., do/should they care what the difference is between WFS and WMS?)

MilesMcBain commented 3 years ago

UF4: Would it be worth implementing distinct.bcdc_promise?

This could resolve the issue mentioned in the spatial query vignette:

For example, there is currently no way to ask the catalogue for all possible unique values of SCHOOL_DISTRICT_NAME. Is that case the data will need to be brought into R and unique values will need to be determined there.

The user could do:

bcdc_query_geodata("78ec5279-4534-49a1-97e8-9d315936f08b") %>%
  distinct(SCHOOL_DISTRICT_NAME) %>%
  collect()

The current option involving pulling the entire SCHOOL_DISTRICT_NAME down to the local session takes a while on my connection, so I think this could be a useful one for tuning queries.

MilesMcBain commented 3 years ago

Regarding UF4 I see now after looking at the correct CQL spec, that DISTINCT is not supported, nor does it appear to have a similar function under a different name.

MilesMcBain commented 3 years ago

UF5: The documentation for object conversion to a bounding box when using the spatial predicates could be improved.

The message the user sees:

The object is too large to perform exact spatial operations using bcdata. To simplify the polygon, a bounding box was drawn around the polygon and all features within the box will be returned. Options include further processing with on the returned object or simplify the object.

does not say what size threshold was used, and how much it was exceeded by. If I would prefer to take the simplification option, assuming it is a genuine option, it would be nice to know what target I am aiming for and how close I am to it.

The first place I would look to demystify this would be the help file for the spatial predicates (cql_geom_predicates), but I think the help is underdone with respect to this functionality. It should probably mention the internal option bcdata.max_geom_pred_size and how the size is calculated. Alternately it could refer users to the efficiently-query-spatial... vignette for this information.

On that point, I noticed in the source that the size is calculated on the entire spatial object, rather than just the geometry as I would have expected. See: https://github.com/bcgov/bcdata/blob/master/R/cql-geom-predicates.R#L114.

My recommendation would be to change this so that the size is calculated just on the data that will be translated to text via your sf::st_as_text call. This will make the code a bit more cumbersome, but s3 methods for size calculation might be used to mitigate that. With this change the sizing of objects is much more predictable, making it a simpler task to bring data below the threshold.

Users could then be told it's a case of utils::object.size(st_geometry(obj)) <= getOption("bcdata.max_geom_pred_size") or a bounding box will be used.

If you wanted to be really kind you could also expose a function that helps the user determine if a spatial object will be converted to a bounding box in their predicates.

If you don't take my recommendation, then I think the size calculation is surprising enough that it needs to be well documented.

A summary of UF5 suggestions:

ateucher commented 3 years ago

Thanks again @MilesMcBain - it is indeed unfortunate that there is no DISTINCT or analogue in the WFS... not sure there's a good way around that.

Those are excellent suggestions around the conversion to bounding box - I again opened a separate issue where we'll tackle that.

ateucher commented 3 years ago

Summary of changes:

ateucher commented 3 years ago

All user feedback issues now addressed