cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Error in basename(path) : a character vector argument expected #87

Closed joelnitta closed 1 year ago

joelnitta commented 1 year ago

Sorry, this is hard to provide a reprex because I believe it depends on behavior of the server from where I'm trying to download data. I encounter this error when using contentid::resolve(). I think it is because for some reason downloading from the URL doesn't work (times out etc). Then path used for valid_store_path() is NULL and we get Error in basename(path) : a character vector argument expected. (I think the original download error is getting hidden by try_catch() here).

So there should be some clearer error message generated if the failure is indeed due to inability to download the file.

I encountered it with this registry (but like I said, sometimes it works, sometimes it doesn't; I think it has something to do with the server. It's FTP, I don't know if that affects anything):

Call this "local_contentid.tsv":

identifier  source  date    size    status  md5 sha1    sha256  sha384  sha512
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip    2022-12-07T22:17:58Z    NA  200 NA  NA  hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  NA  NA

Command that generates the error:

> contentid::resolve("hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a", registries = "local_contentid.tsv")
Error in basename(path) : a character vector argument expected 
contentid info from renv: ``` "contentid": { "Package": "contentid", "Version": "0.0.16", "Source": "GitHub", "RemoteType": "github", "RemoteHost": "api.github.com", "RemoteUsername": "cboettig", "RemoteRepo": "contentid", "RemoteRef": "master", "RemoteSha": "c87ba68be3dc926395d26a550b29fe95eb4915e3", "Hash": "bee3595183bb57c7f8352684cfaf2503", "Requirements": [ "curl", "fs", "httr", "openssl" ] } ``` ``` > sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] forcats_0.5.1 stringr_1.4.1 dplyr_1.0.9 [4] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 [7] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 [10] future.callr_0.8.0 future_1.28.0 taxastand_1.0.0 [13] CoordinateCleaner_2.0-20 rgbif_3.7.3 assertr_2.8 [16] tarchetypes_0.7.0 targets_0.13.1 contentid_0.0.16 [19] languageserver_0.3.14 loaded via a namespace (and not attached): [1] fs_1.5.2 sf_1.0-8 lubridate_1.8.0 [4] oai_0.3.2 bit64_4.0.5 httr_1.4.4 [7] tools_4.2.1 backports_1.4.1 utf8_1.2.2 [10] rgdal_1.5-32 R6_2.5.1 KernSmooth_2.23-20 [13] rgeos_0.5-9 DBI_1.1.3 lazyeval_0.2.2 [16] colorspace_2.0-3 raster_3.5-29 withr_2.5.0 [19] sp_1.5-0 tidyselect_1.1.2 processx_3.7.0 [22] bit_4.0.4 curl_4.3.2 compiler_4.2.1 [25] rvest_1.0.2 cli_3.3.0 archive_1.1.5 [28] xml2_1.3.3 scales_1.2.1 classInt_0.4-7 [31] callr_3.7.2 proxy_0.4-27 askpass_1.1 [34] digest_0.6.29 pkgconfig_2.0.3 parallelly_1.32.1 [37] dbplyr_2.1.1 readxl_1.4.0 rlang_1.0.5 [40] generics_0.1.3 jsonlite_1.8.0 vroom_1.5.7 [43] googlesheets4_1.0.0 magrittr_2.0.3 geosphere_1.5-14 [46] Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3 [49] lifecycle_1.0.1 terra_1.6-7 stringi_1.7.8 [52] whisker_0.4 yaml_2.3.5 plyr_1.8.7 [55] grid_4.2.1 parallel_4.2.1 listenv_0.8.0 [58] crayon_1.5.1 lattice_0.20-45 haven_2.5.0 [61] conditionz_0.1.0 hms_1.1.2 knitr_1.40 [64] ps_1.7.1 pillar_1.8.1 igraph_1.3.4 [67] uuid_1.1-0 base64url_1.4 codetools_0.2-18 [70] reprex_2.0.1 glue_1.6.2 modelr_0.1.8 [73] data.table_1.14.2 renv_0.15.5 vctrs_0.4.1 [76] tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.1 [79] openssl_2.0.2 assertthat_0.2.1 xfun_0.32 [82] broom_0.8.0 e1071_1.7-11 rnaturalearth_0.1.0 [85] class_7.3-20 googledrive_2.0.0 gargle_1.2.0 [88] units_0.8-0 globals_0.16.1 ellipsis_0.3.2 ```
cboettig commented 1 year ago

thanks for the detailed report!

I think I've been able to isolate this with your help and have patched an issue in https://github.com/cboettig/contentid/pull/88. As you observed, the tryCatch conditions fall back on NULL path, which isn't parsable in basename(), instead of NAcharacter.

Still not sure precisely what triggers the conditions of that failure though. Most bad download errors should pass over and ultimately throw an NA e.g. let's take this example, but alter the local_content.tsv to have a bad URL (e.g. a 408 timeout error):


url <- "http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip"
id <- contentid::register(url, registries = "local_contentid.tsv")
reg <- readr::read_tsv("local_contentid.tsv")
reg$source[which(reg$source == url)] <- "http://httpbin.org/status/408"
readr::write_tsv(reg, "local_contentid.tsv")
contentid::resolve(id)

resolves to NA already. I think what happens is that download_resource is getting something it thinks is neither a URL (contentid:::is_url, which includes ftp://) or a local file path, and so was returning NULL when it should have returned NAcharacter. I think this could happen if the source was a local file path that was later deleted?

cboettig commented 1 year ago

Can you let me know if #88 seems to fix this?

joelnitta commented 1 year ago

I tried with #88, but still get the same error (I simulated a download problem by turning my wifi off shortly after starting resolve():

> contentid::resolve("hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a", registries = "local_contentid.tsv")
Error in basename(path) : a character vector argument expected   

Although what you say about

I think this could happen if the source was a local file path that was later deleted?

rings a bell... I actually see the same problem using another local registry that includes entries for the same hash as local files, which I since deleted (I was trying those as a work-around).

... which leads to a related question: is it OK to include paths to non-existent local files in the registry so they can be used as a backup in case the download doesn't work?

cboettig commented 1 year ago

thanks! hmm..... Can you share the full local_contentid.tsv?

Can you show me the full error trace (eg. via options(error=recover) ) and the value of path at the error? (valid_store_path() is called at two different points in resolve() so I'm trying to figure out which one.

Also, do you get NA or an error with my toy reprex above?

joelnitta commented 1 year ago

Sorry for the delay.

Full local_content.tsv:

identifier  source  date    size    status  md5 sha1    sha256  sha384  sha512
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  _targets/user/data/wcvp_dwca.zip    2022-12-07T05:57:19Z    79537053    200 NA  NA  hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  NA  NA
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  /_targets/user/data/wcvp_dwca.zip   2022-12-07T06:00:07Z    79537053    200 NA  NA  hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  NA  NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  _targets/user/data/wcvp.zip 2022-12-07T06:07:10Z    84920347    200 NA  NA  hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  NA  NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  /_targets/user/data/wcvp.zip    2022-12-07T06:05:34Z    84920347    200 NA  NA  hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  NA  NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  _targets/user/data/taxize_apg_families.csv  2022-12-07T06:53:33Z    130957  200 NA  NA  hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  NA  NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  /_targets/user/data/taxize_apg_families.csv 2022-12-07T06:53:28Z    130957  200 NA  NA  hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  NA  NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  data_local/taxize_apg_families.csv  2022-12-08T01:38:45Z    130957  200 NA  NA  hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba  NA  NA
hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5  /_targets/user/data/wgsrpd/level3/level3.shp    2022-12-07T07:51:59Z    7282396 200 NA  NA  hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5  NA  NA
hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5  _targets/user/data/wgsrpd/level3/level3.shp 2022-12-07T07:52:05Z    7282396 200 NA  NA  hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5  NA  NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  http://sftp.kew.org/pub/data-repositories/WCVP/wcvp.zip 2022-12-07T22:16:07Z    NA  200 NA  NA  hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64  NA  NA
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip    2022-12-07T22:17:58Z    NA  200 NA  NA  hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a  NA  NA
hash://sha256/d3772de1d07d2f3f093e4b0efdb14c854ce0ba212ee36cbadeb53a82c45e49de  https://github.com/tdwg/wgsrpd/archive/refs/heads/master.zip    2022-12-08T01:59:27Z    NA  200 NA  NA  hash://sha256/d3772de1d07d2f3f093e4b0efdb14c854ce0ba212ee36cbadeb53a82c45e49de  NA  NA

Full error trace:

> file <- wcvp_hash %>%
+   contentid::resolve(registries = local_registry)
Error in basename(path) : a character vector argument expected                                                     

Enter a frame number, or 0 to exit   

1: wcvp_hash %>% contentid::resolve(registries = local_registry)
2: contentid::resolve(., registries = local_registry)
3: valid_store_path(id, path)
4: basename(path)

When I try with your reprex, I get:

[1] NA                 
Warning message:
In contentid::resolve(id) :
  No sources found for hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a
cboettig commented 1 year ago

ok cool. yeah so almost surely has something to do with the local sources not being there any more . But should still throw an NA if it can't find any other sources for the same hash.

Side question really, but how do you generate your local sources anyway? I usually try and use store() or resolve(..., store=TRUE) which puts content in a local 'content store' -- i.e., stored by filepath = hash. That can obviously eat up disk space quickly, though we have a utility to purge older files. Though in your case it could also save you space -- lookslike you have identical hash objects at multiple locations on your disk.

joelnitta commented 1 year ago

but how do you generate your local sources anyway?

register(file_path, "local_contentid.tsv")

The reason for identical hash objects at multiple locations is because I am using a targets plan where I want the user to have the option of using a _targets cache (folder) within the project (default usage) or a shared cache located somewhere else (in this case, actually a dropbox folder, but mounted to root of a docker container where I'm running the code). Kinda complicated, but I was hoping it would work - my understanding was that even if there are multiple identical hashes at different locations in the registry, contentid would just try it until it finds one that works. Though I now realize that might not work so well if some are local and some are remote, as we would want to prioritize using local first, and I don't know if you have it set up to do that.

(I was just going to do the URL, but since it kept failing, I thought the local files could act as a fall-back: the user could manually download them and put them in _targets/user/data and contentid would find them there).

joelnitta commented 1 year ago

Another question is why the download is failing so often in the first place. The files aren't particularly large (< 100 mb). wget and curl on the terminal both seem to work fine 🤔

joelnitta commented 1 year ago

OK I think I've figure out a bit more what is going on... it looks like the remote file (http://sftp.kew.org/pub/data-repositories/WCVP/wcvp.zip) did change recently (Dec 8 - the same day I filed my issue! 🤯), and now has a different hash. So I think the reason the download did not work is because contentid was functioning correctly, and not returning the downloaded file path because the hash didn't match. But then if there is no local copy to use in _targets/user/data, it gave the strange Error in basename(path) error because it tried every entry in the local registry, and the last thing it tried was a non-existent local file.

cboettig commented 1 year ago

ok, very cool. yeah, there's definitely a bug somewhere in path winding up NULL instead of NA and thus throwing an error, though other than the patch in #88 I haven't figured out precisely how that happens.

i'm kinda on the fence about the idea of allowing register() to register local paths in place. In general, if you just swap store() for register, you'll get an id back, but it will also copy the object to the content store. I suspect this will give you more robust behavior.

Like you say, contentid doesn't really care where the object is located, and is happy to try multiple sources starting with local sources (since that is almost surely the fastest option) and then trying any URLs or official data repositories like zenodo or software heritage or dataone. but of course the bytes have to exist somewhere or it will come back NA. Of course it can't tell between a truely local path and something that just looks local but is realyl a remotely mounted ftp, nfs, or something. Also obviously eventually having a copy of the object in a permanent scientific archive is the most robust fall-back, though often also the slowest to access.

So, take-aways: