bcgov / bcdata

An R package for searching & retrieving data from the B.C. Data Catalogue
https://bcgov.github.io/bcdata
Apache License 2.0
81 stars 12 forks source link

[JOSS Review]: Substantial scholarly effort--add proof it's not a "thin client" #248

Closed twedl closed 3 years ago

twedl commented 3 years ago

I think the package clears the bar for scholarly contribution for https://github.com/openjournals/joss-reviews/issues/2927, but the paper doesn't quite prove it yet. The criteria I'm thinking of is Substantial scholarly effort:

“Minor utility” packages, including “thin” API clients, and single-function packages are not acceptable.

So what makes this a "thick" client? To me, it's the filter-then-collect bit, which is mentioned in the paper, but you don't hype it up enough or give details about how much it's helping. The two things I want to know / want you to be specific about are:

  1. Does filter-then-collect download the whole dataset, then filter it before giving it to you? You write:

"The data is only downloaded, and loaded into R as an ‘sf’ object, once the query is complete and the user requests the final result".

If I was reading this for the first time, I might not know that it's filtered before downloading. A rephrase might be "Once the query is complete, the final result is downloaded and loaded into R as an 'sf' object." That's also not perfect, but will help answer my question along with the next point.

  1. How much does the pre-filter help? Is it worth the extra effort over just using a 'thin' client and filtering yourself? Here's an example that you could use:
all_ha <- bcdc_query_geodata(ha_record) %>%
  collect()

my_ha <- bcdc_query_geodata(ha_record) %>%
  filter(HLTH_AUTHORITY_NAME == "Interior") %>%
  collect()

paste0(
  "The full data is ", format(object.size(all_ha), units = "auto"),
  ", the filtered data is ", format(object.size(my_ha), units = "auto"),
  ", which is a ", scales::percent(1 - as.numeric(object.size(my_ha)) / as.numeric(object.size(all_ha))),
  " reduction!"
)
)
#> "The full data is 9.9 Mb, the filtered data is 239.7 Kb, which is a 98% reduction!"

Technically this still doesn't make the point, because you could swap the collect and filter calls and still end up with the same object size reduction even though you did download the entire file. So, maybe use this along with a line that explicitly states that the filtering is done by the API before downloading, or an intermediate example where you call filter and not collect. The filter without collect gives that great message about how it's working.

NB: I switched to the "Interior" region because the coastline is usually what makes the shapefiles so large. The current example in the paper should give around a 70% improvement, which is also great and makes a better looking map, so feel free to keep that.

stephhazlitt commented 3 years ago

This is a great suggestion, thanks @tweed1e.

ateucher commented 3 years ago

Closed by #263