Closed joelnitta closed 1 year ago
thanks for the detailed report!
I think I've been able to isolate this with your help and have patched an issue in https://github.com/cboettig/contentid/pull/88. As you observed, the tryCatch conditions fall back on NULL path
, which isn't parsable in basename(), instead of NAcharacter.
Still not sure precisely what triggers the conditions of that failure though. Most bad download errors should pass over and ultimately throw an NA e.g. let's take this example, but alter the local_content.tsv to have a bad URL (e.g. a 408 timeout error):
url <- "http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip"
id <- contentid::register(url, registries = "local_contentid.tsv")
reg <- readr::read_tsv("local_contentid.tsv")
reg$source[which(reg$source == url)] <- "http://httpbin.org/status/408"
readr::write_tsv(reg, "local_contentid.tsv")
contentid::resolve(id)
resolves to NA already. I think what happens is that download_resource is getting something it thinks is neither a URL (contentid:::is_url, which includes ftp://
) or a local file path, and so was returning NULL when it should have returned NAcharacter. I think this could happen if the source was a local file path that was later deleted?
Can you let me know if #88 seems to fix this?
I tried with #88, but still get the same error (I simulated a download problem by turning my wifi off shortly after starting resolve()
:
> contentid::resolve("hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a", registries = "local_contentid.tsv")
Error in basename(path) : a character vector argument expected
Although what you say about
I think this could happen if the source was a local file path that was later deleted?
rings a bell... I actually see the same problem using another local registry that includes entries for the same hash as local files, which I since deleted (I was trying those as a work-around).
... which leads to a related question: is it OK to include paths to non-existent local files in the registry so they can be used as a backup in case the download doesn't work?
thanks! hmm..... Can you share the full local_contentid.tsv
?
Can you show me the full error trace (eg. via options(error=recover)
) and the value of path
at the error? (valid_store_path()
is called at two different points in resolve()
so I'm trying to figure out which one.
Also, do you get NA or an error with my toy reprex above?
Sorry for the delay.
Full local_content.tsv
:
identifier source date size status md5 sha1 sha256 sha384 sha512
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a _targets/user/data/wcvp_dwca.zip 2022-12-07T05:57:19Z 79537053 200 NA NA hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a NA NA
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a /_targets/user/data/wcvp_dwca.zip 2022-12-07T06:00:07Z 79537053 200 NA NA hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a NA NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 _targets/user/data/wcvp.zip 2022-12-07T06:07:10Z 84920347 200 NA NA hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 NA NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 /_targets/user/data/wcvp.zip 2022-12-07T06:05:34Z 84920347 200 NA NA hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 NA NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba _targets/user/data/taxize_apg_families.csv 2022-12-07T06:53:33Z 130957 200 NA NA hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba NA NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba /_targets/user/data/taxize_apg_families.csv 2022-12-07T06:53:28Z 130957 200 NA NA hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba NA NA
hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba data_local/taxize_apg_families.csv 2022-12-08T01:38:45Z 130957 200 NA NA hash://sha256/19f5844960f7985f012a23490cbc5173c96ceab85b04bc99e45c092eb37ff9ba NA NA
hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5 /_targets/user/data/wgsrpd/level3/level3.shp 2022-12-07T07:51:59Z 7282396 200 NA NA hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5 NA NA
hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5 _targets/user/data/wgsrpd/level3/level3.shp 2022-12-07T07:52:05Z 7282396 200 NA NA hash://sha256/df202da01ea58a46c6f7a7674bd61f438fbef6c43f3ca1f05855852fab6eb8a5 NA NA
hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 http://sftp.kew.org/pub/data-repositories/WCVP/wcvp.zip 2022-12-07T22:16:07Z NA 200 NA NA hash://sha256/10e51d931fc145d5c30f664c9e5dad62ae68f2cb79367e0ceb827a20e6884f64 NA NA
hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip 2022-12-07T22:17:58Z NA 200 NA NA hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a NA NA
hash://sha256/d3772de1d07d2f3f093e4b0efdb14c854ce0ba212ee36cbadeb53a82c45e49de https://github.com/tdwg/wgsrpd/archive/refs/heads/master.zip 2022-12-08T01:59:27Z NA 200 NA NA hash://sha256/d3772de1d07d2f3f093e4b0efdb14c854ce0ba212ee36cbadeb53a82c45e49de NA NA
Full error trace:
> file <- wcvp_hash %>%
+ contentid::resolve(registries = local_registry)
Error in basename(path) : a character vector argument expected
Enter a frame number, or 0 to exit
1: wcvp_hash %>% contentid::resolve(registries = local_registry)
2: contentid::resolve(., registries = local_registry)
3: valid_store_path(id, path)
4: basename(path)
When I try with your reprex, I get:
[1] NA
Warning message:
In contentid::resolve(id) :
No sources found for hash://sha256/56112aa6f108db89b7c324f96b015d863d4b20c0dc201be256327f857052a08a
ok cool. yeah so almost surely has something to do with the local sources not being there any more . But should still throw an NA if it can't find any other sources for the same hash.
Side question really, but how do you generate your local sources anyway? I usually try and use store()
or resolve(..., store=TRUE)
which puts content in a local 'content store' -- i.e., stored by filepath = hash. That can obviously eat up disk space quickly, though we have a utility to purge older files. Though in your case it could also save you space -- lookslike you have identical hash objects at multiple locations on your disk.
but how do you generate your local sources anyway?
register(file_path, "local_contentid.tsv")
The reason for identical hash objects at multiple locations is because I am using a targets plan where I want the user to have the option of using a _targets
cache (folder) within the project (default usage) or a shared cache located somewhere else (in this case, actually a dropbox folder, but mounted to root of a docker container where I'm running the code). Kinda complicated, but I was hoping it would work - my understanding was that even if there are multiple identical hashes at different locations in the registry, contentid would just try it until it finds one that works. Though I now realize that might not work so well if some are local and some are remote, as we would want to prioritize using local first, and I don't know if you have it set up to do that.
(I was just going to do the URL, but since it kept failing, I thought the local files could act as a fall-back: the user could manually download them and put them in _targets/user/data
and contentid would find them there).
Another question is why the download is failing so often in the first place. The files aren't particularly large (< 100 mb). wget
and curl
on the terminal both seem to work fine 🤔
OK I think I've figure out a bit more what is going on... it looks like the remote file (http://sftp.kew.org/pub/data-repositories/WCVP/wcvp.zip) did change recently (Dec 8 - the same day I filed my issue! 🤯), and now has a different hash. So I think the reason the download did not work is because contentid was functioning correctly, and not returning the downloaded file path because the hash didn't match. But then if there is no local copy to use in _targets/user/data
, it gave the strange Error in basename(path)
error because it tried every entry in the local registry, and the last thing it tried was a non-existent local file.
ok, very cool. yeah, there's definitely a bug somewhere in path winding up NULL
instead of NA
and thus throwing an error, though other than the patch in #88 I haven't figured out precisely how that happens.
i'm kinda on the fence about the idea of allowing register()
to register local paths in place. In general, if you just swap store()
for register, you'll get an id
back, but it will also copy the object to the content store. I suspect this will give you more robust behavior.
Like you say, contentid doesn't really care where the object is located, and is happy to try multiple sources starting with local sources (since that is almost surely the fastest option) and then trying any URLs or official data repositories like zenodo or software heritage or dataone. but of course the bytes have to exist somewhere or it will come back NA. Of course it can't tell between a truely local path and something that just looks local but is realyl a remotely mounted ftp, nfs, or something. Also obviously eventually having a copy of the object in a permanent scientific archive is the most robust fall-back, though often also the slowest to access.
So, take-aways:
I'll still try and debug this case to avoid the awkward basename error, though I think the best possible behavior would merely return NA
and perhaps say 'no sources found matching hash'.
Second, curious if you find store()
to work better for you. If so, we could also alter the behavior of register()
on a local path, such that it also copies the file into the local content store? keeping a registry of local paths that may later cease to exist seems unlikely to be all that useful.
Note that the location of the content store is still configurable, and could technically use a dropbox-type location, but in general I think it's better to think of the content store as a local cache and let each machine maintain it's own local cache, while always having at least one URL-based access in settings where you are working across multiple machines. (I guess that would feel more viable if the URL-based download didn't fail, but sounds like that turned out to be a legitimate data-change and not a random download fail, right?)
Sorry, this is hard to provide a reprex because I believe it depends on behavior of the server from where I'm trying to download data. I encounter this error when using
contentid::resolve()
. I think it is because for some reason downloading from the URL doesn't work (times out etc). Thenpath
used forvalid_store_path()
isNULL
and we getError in basename(path) : a character vector argument expected
. (I think the original download error is getting hidden bytry_catch()
here).So there should be some clearer error message generated if the failure is indeed due to inability to download the file.
I encountered it with this registry (but like I said, sometimes it works, sometimes it doesn't; I think it has something to do with the server. It's FTP, I don't know if that affects anything):
Call this "local_contentid.tsv":
Command that generates the error: