DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
256 stars 85 forks source link

findNLDI returns error when API return empty json #623

Closed mps9506 closed 1 year ago

mps9506 commented 2 years ago

Hi,

This isn't fully a bug.

Currently findNLDI returns a somewhat cryptic message and errors when making a query that returns nothing (valid query but nothing is found).

For example I know the following will return nothing: https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2

{"type":"FeatureCollection"}

Here is what findNLDI returns when it encounters the same thing:

findNLDI(wqp = "TCEQMAIN-10016",
    nav = "UM",
    find = "nwissite",
    distance_km = 2,
    no_sf = FALSE)

Error: Cannot open "{"type":"FeatureCollection"}"; Check connection parameters.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 1: Invalid FeatureCollection object. Missing 'features' member.
2: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 1: Layer schema generation failed.
3: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 4: Failed to read GeoJSON data

I think for a user expecting a response, the current error and warning messages are hard to troubleshoot. The previous iteration of navigate_nldi in nhdPlusTools returned an empty tibble in a similar scenario: https://github.com/USGS-R/nhdplusTools/issues/97. Having a message indicating that no features were returned and returning a NULL or empty object/list might be beneficial for users.

As always, thanks for this awesome package!!! Happy to make a pull request if there is a way you prefer this handled.

ksonda commented 2 years ago

Possibly related, or at least could be resolved in the same place in the code https://github.com/internetofwater/nldi-services/issues/315

ldecicco-USGS commented 2 years ago

@dblodgett-usgs or @mikejohnson51 can one of you take a look?

mikejohnson51 commented 2 years ago

@ldecicco-USGS, @mps9506

Here is a working option that I can send as a PR:

library(dataRetrieval)
findNLDI(wqp = "TCEQMAIN-10016",
              nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = FALSE)

#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-08 by the reprex package (v2.0.1)

The "design" choice I used in this NLDI client was that if data didn't exist it would not be returned. So in the above case, since no upstream gages were found in 2km, there is no nwis data. This is different than the nhdplusTools design in which an empty tibble is returned of all requested data. For my use cases, the existing option is ideal, but I can see many cases where it could raise confusion. I am not wed to one or the other and changing the output to an empty tibble would be easy, as would eliciting a message that no data was found.

What do you think?

mps9506 commented 2 years ago

@mikejohnson51 Thanks! I think this approach plus a message makes sense.

mikejohnson51 commented 2 years ago

Awesome - now we're looking at something like this:

library(dataRetrieval)
xx = findNLDI(wqp = "TCEQMAIN-10016",
              nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = FALSE)
#> No data found for: nwissite?distance=2

xx
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-08 by the reprex package (v2.0.1)

EthanGrahn commented 2 years ago

The response of {"type":"FeatureCollection"} is an issue with how the API is assembling the response. I would expect it to return something like this for an empty result.

{
    "type": "FeatureCollection",
    "features": []
}

I would expect findNLDI to handle it normally from there. I will create an issue in the NLDI repository to correct this.

ldecicco-USGS commented 2 years ago

@mps9506 and @EthanGrahn , do you all want to try installing like this:

remotes::install_github("USGS-R/dataRetrieval")

and let us know if you think we can close this issue?

EthanGrahn commented 2 years ago

I've updated the NLDI to output my previous example for empty results. This seems to introduce a different set of errors. It may be an issue with my tests because I don't typically work with R. Can anybody else confirm this? From what I can tell, findNLDI checks for an empty response (i.e. an empty string) to determine whether no features were found. It should, instead, be accounting for empty GeoJSON.

mps9506 commented 2 years ago

I updated to the development version and received the following error:

> findNLDI(wqp = "TCEQMAIN-10016",
+          nav = "UM",
+          find = "nwissite",
+          distance_km = 2,
+          no_sf = FALSE)
Error in if (sf::st_geometry_type(tmp)[1] == "POINT") { : 
  missing value where TRUE/FALSE needed

As @EthanGrahn notes, I believe this is because read_sf() now returns an empty sf object:

> df <- readLines("https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2")
> sf::read_sf(df)
Simple feature collection with 0 features and 0 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
# A tibble: 0 × 1
# … with 1 variable: geometry <GEOMETRY [°]>

> sf::read_sf(df) |> sf::st_coordinates()
     [,1] [,2]

sessionInfo:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dataRetrieval_2.7.11.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3       compiler_4.2.1     pillar_1.7.0       class_7.3-20       remotes_2.4.2     
 [6] tools_4.2.1        jsonlite_1.8.0     lifecycle_1.0.1    tibble_3.1.7       pkgconfig_2.0.3   
[11] rlang_1.0.3        DBI_1.1.3          cli_3.3.0          rstudioapi_0.13    curl_4.3.2        
[16] e1071_1.7-11       dplyr_1.0.9        httr_1.4.3         xml2_1.3.3         generics_0.1.3    
[21] vctrs_0.4.1        classInt_0.4-7     grid_4.2.1         tidyselect_1.1.2   glue_1.6.2        
[26] sf_1.0-7           R6_2.5.1           fansi_1.0.3        purrr_0.3.4        magrittr_2.0.3    
[31] ellipsis_0.3.2     units_0.8-0        assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.2        
EthanGrahn commented 2 years ago

This raises another question on my end. Would it be better for the end user to receive this empty collection, or be returned a 404 instead?

As a comparison, searching for an invalid source or ID will return a 404 with an error message. https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/fake-id?f=json

mikejohnson51 commented 2 years ago

@mps9506 I'll need to dig into why you get that error, on my end I get the desired output and dont see any glaring differences in the package versions:

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
                     nav = "UM",
                     find = "nwissite",
                     distance_km = 2,
                     no_sf = FALSE)
#> No data found for: nwissite?f=json&distance=2
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dataRetrieval_2.7.11.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3       pillar_1.7.0       compiler_4.1.0     highr_0.9         
#>  [5] class_7.3-20       R.methodsS3_1.8.2  R.utils_2.12.0     tools_4.1.0       
#>  [9] digest_0.6.29      jsonlite_1.8.0     evaluate_0.15      lifecycle_1.0.1   
#> [13] tibble_3.1.7       R.cache_0.15.0     pkgconfig_2.0.3    rlang_1.0.3       
#> [17] reprex_2.0.1       DBI_1.1.3          cli_3.3.0          rstudioapi_0.13   
#> [21] curl_4.3.2         yaml_2.3.5         xfun_0.31          fastmap_1.1.0     
#> [25] e1071_1.7-11       dplyr_1.0.9        withr_2.5.0        styler_1.7.0      
#> [29] stringr_1.4.0      httr_1.4.3         knitr_1.39         xml2_1.3.3        
#> [33] generics_0.1.3     fs_1.5.2           vctrs_0.4.1        tidyselect_1.1.2  
#> [37] grid_4.1.0         classInt_0.4-7     glue_1.6.2         sf_1.0-7          
#> [41] R6_2.5.1           fansi_1.0.3        rmarkdown_2.14     purrr_0.3.4       
#> [45] magrittr_2.0.3     units_0.8-0        ellipsis_0.3.2     htmltools_0.5.2   
#> [49] assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.2         proxy_0.4-27      
#> [53] stringi_1.7.6      crayon_1.5.1       R.oo_1.25.0

Created on 2022-07-11 by the reprex package (v2.0.1)

ldecicco-USGS commented 2 years ago

I was just testing this and I think I can reproduce both results. Initially I didn't have sf installed on my computer (I recently updated to R 4.2). Without sf, I get @mikejohnson51 's output:

> findNLDI(wqp = "TCEQMAIN-10016",
+          nav = "UM",
+          find = "nwissite",
+          distance_km = 2,
+          no_sf = FALSE)
            sourceName     identifier    comid                     name
1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
          X       Y
1 -101.3427 35.7422

When I installed sf, ran the same thing:

> xx = findNLDI(wqp = "TCEQMAIN-10016",
+               nav = "UM",
+               find = "nwissite",
+               distance_km = 2,
+               no_sf = FALSE)
 Error in if (sf::st_geometry_type(tmp)[1] == "POINT") { : 
missing value where TRUE/FALSE needed
mikejohnson51 commented 2 years ago

I have this on todays todo list. I have sf installed though so I think I'm missing something else 😭

Here is a reprex using when sf is installed locally:

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
                     nav = "UM",
                     find = "nwissite",
                     distance_km = 2,
                     no_sf = TRUE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE)
#> No data found for: nwissite?f=json&distance=2
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-13 by the reprex package (v2.0.1)

Removing sf, running TRUE/FALSE:

library(dataRetrieval)
remove.packages('sf')
#> Removing package from '/Library/Frameworks/R.framework/Versions/4.2/Resources/library'
#> (as 'lib' is unspecified)
#> Error in find.package(pkgs, lib): there is no package called 'sf'
findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = TRUE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

Created on 2022-07-13 by the reprex package (v2.0.1)

Source of message:

https://github.com/USGS-R/dataRetrieval/blob/7182605f9f45574b768cd21bca376f5678688da6/R/findNLDI.R#L118

ldecicco-USGS commented 2 years ago

@mps9506 and @EthanGrahn can you try installing again and see if it looks better?

remotes::install_github("USGS-R/dataRetrieval")
mps9506 commented 2 years ago

I think this works well. I hesitate to ask since I'm not doing any of the work here. But should findNLDI always return a list or is a sf/tbl ok?

Valid result:

findNLDI(wqp = "TCEQMAIN-10032",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = TRUE) |> 
  str()

# List of 2
#  $ origin     :'data.frame':  1 obs. of  6 variables:
#   ..$ sourceName: chr "Water Quality Portal"
#   ..$ identifier: chr "TCEQMAIN-10032"
#   ..$ comid     : chr "19972126"
#   ..$ name      : chr "CANADIAN RIVER AT US 60-83"
#   ..$ X         : num -100
#   ..$ Y         : num 35.9
#  $ UM_nwissite:'data.frame':  1 obs. of  8 variables:
#   ..$ sourceName: chr "NWIS Surface Water Sites"
#   ..$ identifier: chr "USGS-07228000"
#   ..$ comid     : chr "19972126"
#   ..$ measure   : num 79.5
#   ..$ reachcode : chr "11090106000027"
#   ..$ name      : chr "Canadian Rv nr Canadian, TX"
#   ..$ X         : num -100
#   ..$ Y         : num 35.9

compared to:

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE) |> 
  str()

# sf [1 × 7] (S3: sf/tbl_df/tbl/data.frame)
#  $ sourceName: chr "Water Quality Portal"
#  $ identifier: chr "TCEQMAIN-10016"
#  $ comid     : chr "19970652"
#  $ name      : chr "DIXON CREEK NE OF BORGER"
#  $ X         : num -101
#  $ Y         : num 35.7
#  $ geometry  :sfc_POINT of length 1; first list element:  'XY' num [1:2] -101.3 35.7
#  - attr(*, "sf_column")= chr "geometry"
#  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA
#   ..- attr(*, "names")= chr [1:6] "sourceName" "identifier" "comid" "name" ...
# Warning message:
# No data returned for: https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2 
dblodgett-usgs commented 2 years ago

I think it probably should -- I've only ever worked with no_sf TRUE or FALSE in isolation though so never really thought about the difference. Good catch. What do you think @ldecicco-USGS ?

mikejohnson51 commented 2 years ago

This issue here is that in findNLDI single element returns are given as an object while multi element returns are given as a list. The default find option is flowline. Flowlines (and any non-point objects) can only be parsed when sf is active. So when sf is off, the retuned object is just the origin (single element). When its on you get the origin and upper mainstem (multi element).

dblodgett-usgs commented 2 years ago

Right -- but why not return a list with one element when you just have that origin to return?

mikejohnson51 commented 2 years ago

🤷 personal preference haha. I don't like the extra structure when its not needed. I'm happy to change it though if there consensus it makes more sense.

ldecicco-USGS commented 1 year ago

Sorry I didn't really voice an opinion on this one. I think @dblodgett-usgs 's suggestion of the list of one makes sense because then scripts won't break if you have one or many returns. @mikejohnson51 does that sound OK? (and more importantly, can you make that change?)

ldecicco-USGS commented 1 year ago

@dblodgett-usgs @mikejohnson51 I'm hoping to do an update to dataRetrieval in the next couple of weeks. Should we switch this to a list of 1 element so that the user doesn't can assume a consistent type of return no matter what the size?

dblodgett-usgs commented 1 year ago

I think that is the right approach, yeah.

mikejohnson51 commented 1 year ago

PR incoming 😃

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 20,
         no_sf = FALSE, 
         warn = FALSE)
#> $origin
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName           ident…¹ comid name      X     Y            geometry
#>   <chr>                <chr>   <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Portal TCEQMA… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)
#> # … with abbreviated variable name ¹​identifier
#> 
#> $UM_nwissite
#> Simple feature collection with 1 feature and 8 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.351 ymin: 35.66476 xmax: -101.351 ymax: 35.66476
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 9
#>   sourceName               identifier    comid measure reach…¹ name      X     Y
#>   <chr>                    <chr>         <chr>   <dbl> <chr>   <chr> <dbl> <dbl>
#> 1 NWIS Surface Water Sites USGS-07227920 1997…    56.0 110901… Dixo… -101.  35.7
#> # … with 1 more variable: geometry <POINT [°]>, and abbreviated variable name
#> #   ¹​reachcode

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = TRUE,
         warn = FALSE)
#> $origin
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

Created on 2022-11-01 by the reprex package (v2.0.1)