DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
259 stars 84 forks source link

Error when combining datasets: incompatible dataTypes #599

Closed lstanish-usgs closed 2 years ago

lstanish-usgs commented 2 years ago

Describe the bug When running a script that loops through a bunch of sites and combines the output, the loop unexpectedly ends with the following error: image

The siteid that has the incompatible dataType was "USGS-01374580". For the previous 287 siteid's the script looped through, the field 'SampleCollectionMethod.MethodIdentifier' was identified as a character; however, this site has data populated with a numeric value, which caused R to read in the field as a 'double'. This caused the combining to fail.

Abbreviated code:
locIdsChar <- as.vector(datLocsCompiled$locIDs) #contains a vector of siteid's to download
wqp_data <- NULL
log_data <- NULL
for(y in 1:length(locIdsChar)) {
  site <- locIdsChar[y]
  print(paste0('Starting ', site) )
  tmpdat <- dataRes(site) #an internal function that calls readWQPdata() for designated site
 # conditional to make sure there is actually data for that site. should be a data.frame. If not, then don't 
 # attempt to merge and log this site as not being able to join.  
 if(is.data.frame(tmpdat)) {
    wqp_data <- bind_rows(wqp_data, tmpdat)
  } else {
    log_data <- rbind(log_data, paste("unable to join data for site", site, sep=" "))
  }
}

Expected behavior Should be able to merge datasets without errors.

Session Info Please include your session info:

version  R version 4.1.0 (2021-05-18)
 os       Windows 10 x64 (build 19042)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  English_United States.1252
 ctype    English_United States.1252
 tz       America/Denver
 date     2022-01-17
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   NA
ldecicco-USGS commented 2 years ago

When something like this gets reported to me, here's my thought process:

  1. What are the repercussions of forcing the column to some specific type? (for this bug...nothing too obvious)
  2. What are the repercussions of letting R choose the best type based on the data? (for this bug...issues with merging as described). Sometimes allowing the column to change depending on the data is OK. The most basic example is ResultMeasureValue . In many many cases, this column is fully numeric, and people want to work with that column as if it's a numeric. So, in the case that the values are all valid numbers, we let the column be numeric, but if there's any text included in that column, it comes back as a character vector. If a user wants to run a loop as described above, they'll need to convert that column to character or numeric (depending on how they want to deal with the non-numeric information).

ANYWAY....I think in this case, setting a "MethodIdentifier" to a character makes sense. I believe we could even extend it to all "Identifier" columns:

ResultAnalyticalMethod.MethodIdentifier, 
ResultAnalyticalMethod.MethodIdentifierContext,
ResultStatusIdentifier,
SampleCollectionMethod.MethodIdentifier,
SampleCollectionMethod.MethodIdentifierContext,
MonitoringLocationIdentifier,
ProjectIdentifier,
ActivityIdentifier

Because the WQP output often has columns that are completely full of NA's, it's been important to hard-code some of those column types (otherwise they'd be set to logical, which is usually the wrong choice).

Why don't we just hard-code all the columns you might ask? Mostly I worry about things changing (ie, new or updated profiles or output formats) and dataRetrieval not keeping up. We've discussed that there might be a way to use a service that WQP could (?) put out that would set the column types dynamically. That would be nice if it changed with changing WQP output profiles for example. That's something we can explore later.

lstanish-usgs commented 2 years ago

I like the idea of automatically setting any field that includes the text string 'Identifier' to character, which can even be done without hard-coding the field names (eg. grep("(?i)identifier", names(dat)), or something), since it's very likely that they are always going to contain data of type 'character'. Ideally, we could obtain a reference from EPA containing data types for every field: I'll do some research. I agree with you that hard-coding all of the columns makes the code less adaptable and will require greater upkeep, something that would be good to avoid if possible!

nrlottig commented 2 years ago

I know that Corinna Gries at UW/EDI would argue that the column types defined in the database should be the type of data returned to user. For example, if datavalue is allowed to be a character in the database, the field should be returned as a character by an r code that reads the data. Letting R decide can be iffy as well because depending on how it looks at the data, it may only look at a few (10s) of rows to decide what data type a column is. I think the safest is to return a table of characters and let the user know that, or ensure that only numeric values exist in data columns that users expect to be numeric.

ldecicco-USGS commented 2 years ago

Agreed that if we know the types, we should use those. Up until recently, a dynamic way to query what the column types should be didn't exist. It was confounded by the rapid proliferation of dataProfiles (adding columns that weren't specifically defined by previous profiles). We choose at the time to set the columns that we knew were the correct types, but allow R to choose if the type wasn't known.

We did bump up the number of rows to use to guess the column type to the "max" in NWIS. I'm only now noticing that that updated didn't extend to the WQP calls. I'll included that update in the next update. In general, we have the majority of column types hard coded already, that I don't expect that to change too much.

We are in talks with the WQP developers that hopefully in the near term we will be able to use a web service to query what the column types should be. It will add some internet call complexity, but make the output more robust and trustworthy.