VLucet / rgovcan

Easy access to the Canadian Open Government Portal
https://vlucet.github.io/rgovcan/
21 stars 4 forks source link

Error retrieving uuid f612e2b4-5c67-46dc-9a84-1154c649ab4e #19

Closed SteveViss closed 1 year ago

SteveViss commented 1 year ago

Hi,

Does someone is able to reproduce the following issue retrieving the uuid f612e2b4-5c67-46dc-9a84-1154c649ab4e?

library(rgovcan)
uid <- "f612e2b4-5c67-46dc-9a84-1154c649ab4e"
output <- "data/"
if(!file.exists(output)) dir.create(output)
govcan_dl_resources(resources = uid, path = output)

Gave me the following error:

ℹ Searching for dataset with id:  f612e2b4-5c67-46dc-9a84-1154c649ab4e
ℹ Record found: "Atlas of Seabirds at Sea in Eastern Canada 2006-2016"
ℹ Atlas of Seabirds at Sea in Eastern Canada 2006-2016 (format: esri rest - size: 0 bytes) ⚠ skipped (not supported).
ℹ Atlas of Seabirds at Sea in Eastern Canada 2006-2016 (format: esri rest - size: 0 bytes) ⚠ skipped (not supported).
ℹ Atlas of Seabirds at Sea in Eastern Canada 2006-2016 (format: wms - size: 0 bytes) ⚠ skipped (not supported).
ℹ Atlas of Seabirds at Sea in Eastern Canada 2006-2016 (format: wms - size: 0 bytes) ⚠ skipped (not supported).
Error in if (x <= 0) 0L else min(as.integer(log(x, base = base)), length(units_map) -  : 
  the condition has length > 1

Seems that all identified resources has a size: 0 bytes, which broke the underlying operations. Traceback

14: format.object_size(structure(tmp, class = "object_size"), units = "auto")
13: format(structure(tmp, class = "object_size"), units = "auto")
12: get_remote_file_size(url)
11: ifelse(fmt != "other", get_remote_file_size(url), "unknown")
10: msgDownload(url, fmt, resources$name)
9: govcan_dl_resources.ckan_resource(X[[i]], ...)
8: FUN(X[[i]], ...)
7: lapply(resources, govcan_dl_resources, ...)
6: govcan_dl_resources.ckan_resource_stack(govcan_get_resources(resources), 
       ...)
5: govcan_dl_resources(govcan_get_resources(resources), ...)
4: govcan_dl_resources.character(resources = uid, path = output)
3: govcan_dl_resources(resources = uid, path = output)
2: withCallingHandlers(expr, warning = function(w) if (inherits(w, 
       classes)) tryInvokeRestart("muffleWarning"))
1: suppressWarnings(govcan_dl_resources(resources = uid, path = output))

Environment

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rgovcan_1.0.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9       fansi_1.0.3      utf8_1.2.2       dbplyr_2.2.1    
 [5] crayon_1.5.2     dplyr_1.0.10     assertthat_0.2.1 crul_1.3        
 [9] R6_2.5.1         jsonlite_1.8.3   lifecycle_1.0.3  DBI_1.1.3       
[13] magrittr_2.0.3   ckanr_0.6.0      pillar_1.8.1     rlang_1.0.6     
[17] cli_3.4.1        curl_4.3.3       vctrs_0.5.0      generics_0.1.3  
[21] urltools_1.7.3   glue_1.6.2       triebeard_0.3.0  compiler_4.2.2  
[25] pkgconfig_2.0.3  tidyselect_1.2.0 tibble_3.1.8     httpcode_0.3.0  
SteveViss commented 1 year ago

I will investigate furthermore, let me know if you can reproduce it.

KevCaz commented 1 year ago

Same here.

VLucet commented 1 year ago

Yes I can reproduce! I'll take a look on Friday :)

VLucet commented 1 year ago

Another day, another example of R 's double pain of lack of type safety and terrible error messages biting us in the rear.

This seems to happen when the url hits a 301 moved permanently response code:

1] "HTTP/1.1 301 Moved Permanently\r\n"                                                                          
 [2] "Server: Microsoft-Azure-Application-Gateway/v2\r\n"                                                          
 [3] "Date: Fri, 18 Nov 2022 15:07:52 GMT\r\n"                                                                     
 [4] "Content-Type: text/html\r\n"                                                                                 
 [5] "Content-Length: 195\r\n"                                                                                     
 [6] "Connection: keep-alive\r\n"                                                                                  
 [7] "Location: https://data.ec.gc.ca/data/species/assess/atlas-of-seabirds-at-sea-in-eastern-canada-2006-2016\r\n"
 [8] "\r\n"                                                                                                        
 [9] "HTTP/1.1 200 OK\r\n"                                                                                         
[10] "Date: Fri, 18 Nov 2022 15:07:53 GMT\r\n"                                                                     
[11] "Content-Type: text/html\r\n"                                                                                 
[12] "Content-Length: 3058\r\n"                                                                                    
[13] "Connection: keep-alive\r\n"                                                                                  
[14] "Cache-Control: public, must-revalidate, max-age=30\r\n"                                                      
[15] "Last-Modified: Wed, 15 Dec 2021 17:10:02 GMT\r\n"                                                            
[16] "Accept-Ranges: bytes\r\n"                                                                                    
[17] "ETag: \"63241143\"\r\n"                                                                                      
[18] "Strict-Transport-Security: max-age=31536000; includeSubDomains\r\n"                                          
[19] "Referrer-Policy: same-origin\r\n"                                                                            
[20] "X-Content-Type-Options: nosniff\r\n"                                                                         
[21] "X-XSS-Protection: 1; mode=block\r\n"                                                                         
[22] "X-DNS-Prefetch-Control: off\r\n"                                                                             
[23] "\r\n"                                                                                                        
attr(,"status")
[1] 200

The following line in the call to get_remote_file_size :

tmp <- as.numeric(
    gsub("\\D", "", hdr[grepl("^Content-Length:", hdr)])
  )

ends up grepping two elements instead of one (one is the expected behavior):

Browse[1]> tmp
[1]  195 3058

then the next call to format breaks, sending an unhelpful error message. This breaks the parent call to msgDownload.

@KevCaz do you agree that the "true" file size should be the second one in such example? In which case we can make sure to only pass the last element when tmp is longer than one.

VLucet commented 1 year ago

@SteveViss Can you confirm it all works well for you when you install from the PR #20 ? I made it work on my machine with that fix

SteveViss commented 1 year ago

@VLucet it works ! Thanks for taking the time on this!