CARDS-Resources / CARDdealr

https://cards-resources.github.io/CARDdealr/
Other
0 stars 1 forks source link

Small changes to superfund #4

Closed seandavi closed 11 months ago

seandavi commented 11 months ago

Removed top-level state filter (wasn't quite working) with the idea that the "expensive" part of the function is the download. So, leave filtering as a post-hoc procedure.

seandavi commented 11 months ago

I simplified below. In short, the lazy eval inside the filter chunk does something odd. Note the difference between the ifelse block vs if...else.

> url = paste0('https://data.epa.gov/efservice/ENVIROFACTS_SITE/JSON')
> state = c('KY', 'KS')
> sf = jsonlite::fromJSON(url)
> dim(sf)
[1] 54723    23
> dplyr::filter(sf, if (!is.null(state)) fk_ref_state_code %in% state else TRUE) |> dim()
[1] 1828   23
> dplyr::filter(sf, ifelse(!is.null(state), fk_ref_state_code %in% state, TRUE)) |> dim()
[1]  0 23
CIOData commented 11 months ago

Okay. Could we put that change in then, or do you think just doing the post-hoc filtering is best? FYI, the same issue applies to the lung cancer screening function. I updated it with that filter earlier today, but accidentally used the wrong column name in doing so.

seandavi commented 11 months ago

May just be me, but I'm a fan of functions that return all the data provided by the API or download--the do one thing and do it well principle. Filtering required to make the data usable and clean is part of that task, but I kinda view filtering by geography as a separate task. In the case of these data, filtering by state may not be appropriate. For example proximity to a superfund site does not respect state borders, so the more appropriate filter (if one were to choose) could be using distance from state population. We can jump on a call to discuss, though, if that would be helpful.

On Tue, Oct 31, 2023 at 11:35 PM CIO Data @.***> wrote:

Okay. Could we put that change in then, or do you think just doing the post-hoc filtering is best? FYI, the same issue applies to the lung cancer screening function. I updated it with that filter earlier today, but accidentally used the wrong column name in doing so.

— Reply to this email directly, view it on GitHub https://github.com/CARDS-Resources/CARDdealr/pull/4#issuecomment-1788355410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWSE5CHDEYDMCSUJVXQHTYCG7QVAVCNFSM6AAAAAA6YKKXYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGM2TKNBRGA . You are receiving this because you authored the thread.Message ID: @.***>

CIOData commented 11 months ago

Sean, I agree in the desire to have all the data. My goal in building a geography filter is to have it default to pulling everything. I also that filtering by state boundaries isn't always optimal, and was initial onboard with having state filters as an accessory wrapper (and still would be), it's just that when I started looking at how some of these pulls will be executed, incorporating it into the base functionality seemed like it would be a time-saver and more pragmatic in some instances. At which point, I figured incorporating it as the base functionality in all of them would then be more consistent, as I hate it when the options for parameters within a package are inconsistent and I have to remind myself of which ones work with which functions.

I'd be happy to jump on a call just to make sure we're on the same page on some things. As I've been building functions, I've started from the principle of directly mimicking how we're pulling things within Cancer InFocus, and would want to make a group decision on if this is best in some instances or whether a more general approach would be preferred. One option I've thought of is making more general pulls and then creating a series of wrapper functions that start with cif_...​ to have tailored Cancer InFocus versions of them.