Closed noamross closed 3 years ago
Thanks for reporting, try now?
A while ago hash-archive.org was getting hammered, which was causing the server to time out on new register()
requests. I switched the default over to my server. Looks like attempting to make
all versions of our full rocker stack filled up my disk recently leading to some unexpected server errors.
Obviously either way this points to a level of fragility around the hash-archive-style registries, perhaps not wholly surprising.
There's a second issue you point out in that register()
should still be registering in your default tsv registry, but it looks like somehow the NA
from the hash-archive is getting returned instead of the local hash, so I'll look into that!
@noamross
For some reason, httr
tells me that your link header is a 403 error:
httr::status_code(httr::HEAD("https://data.humdata.org/dataset/81932f5a-4aa5-4c72-b085-3bde5fef349c/resource/51687557-2934-4d90-98a6-f604a856fdaa/download/population_cod_2018-10-01.zip"))
#> [1] 403
Created on 2021-04-19 by the reprex package (v1.0.0)
(You can confirm this on command line with curl -I -L
).
This is weird, because curl_download()
etc work fine on the URL, so I'm not sure what is up but suggests something about your S3 bucket configuration perhaps? Anyway, that's why the tsv registry you show above is just listing the URL with status 404
instead of successfully computing the identifier locally.
I'm not sure what the right solution is here. Currently content_id()
checks the status code of the header of the URL before attempting to stream the file to download or hash-and-discard (streaming lets you register large files without consuming disk space). We do this because obviously URLs can throw any number of errors, and we'd rather record the error and move on than get stuck midway through streaming a potentially large file.
On the other hand, it's obviously annoying to fail here when we could have registered successfully had we just jumped in and tried to stream the file.
Also we record as a 404 when clearly this a 403. That's obviously wrong, but in extensive testing last year I discovered that some data repositories sometimes give back very unexpected/non-standard status codes (negative integers, text, curl errors) so rather than parse these, the status just defaults to 404. That should be fixed to give the right error code when available and then handle the non-standard errors separately.
any thoughts you have on the best option here would be great.
This isn't my data and I don't know what the back-end hosting is, but in my experience lots of web servers and services do not support HEAD
(I think this includes S3). However, you can get the header by using GET
and not downloading the body with curl::curl_fetch_memory(url, handle = curl::new_handle(nobody = TRUE))
Ah, not quite, you need:
curl::curl_fetch_memory(url, handle = curl::new_handle(nobody = TRUE, customrequest = "GET"))
@noamross thanks! avoiding the HEAD request this way makes sense, really appreciate it.
Haven't had trouble with S3 registration before, e.g. with my MINIO server, or with GitHub-hosted assets which route to S3 as well, so not ruling out a config issue here, but still a good point about some services lacking a HEAD method.
Will still rely on curl returning a good status code, the trick is to avoid returning the content-hash of an error message when the server throws a 403 (or some other error). Will do some more testing with this.
@noamross @cboettig great to see these real world issues with content registration! I think that reporting and teasing out these issues will work towards a more robust registration process. One thought that came to mind is to create some sort of "pool" of content registries with a client-side fail-over to that content can still be registered if one (of many) content registries happens to be unresponsive or otherwise unavailable.
Another part of this would be that registries could act as registry mirrors or somehow synchronize or import content registration entries.
@jhpoelen definitely agreed! hash-archive backend storage is written in LevelDB, https://github.com/btrask/hash-archive/blob/master/src/db.c#L25, unfortunately I've never figured out a convenient way to read data out of the LevelDB dump. hash-archive used to provide SQL dumps, but no longer does so: https://github.com/btrask/hash-archive/issues/1
If we had a utility that could read out and sync across the LevelDB files we could at least sync up self-hosted versions. Ideally this would be a feature added to hash-archive source code though, but the C code there is well beyond my skill.
I may be interpreting the expected behavior wrong, but registering and then querying a URL fails, it appears because registering fails in the local registry and lookup tries the local registry, see below.
I note that if I change the registry to just
registries = "https://hash-archive.org"
registering works fine.Created on 2021-04-17 by the reprex package (v2.0.0)
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.5 (2021-03-31) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2021-04-17 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2) #> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.5) #> contentid 0.0.10 2021-04-16 [1] Github (cboettig/contentid@5591307) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) #> curl 4.3 2019-12-02 [1] CRAN (R 4.0.1) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3) #> knitr 1.32 2021-04-14 [1] CRAN (R 4.0.5) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) #> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) #> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.2) #> vroom 1.4.0 2021-02-01 [1] CRAN (R 4.0.2) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.2) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```