cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

Registering URL fails for default registries #70

Closed noamross closed 3 years ago

noamross commented 3 years ago

I may be interpreting the expected behavior wrong, but registering and then querying a URL fails, it appears because registering fails in the local registry and lookup tries the local registry, see below.

I note that if I change the registry to just registries = "https://hash-archive.org" registering works fine.

hd_pop_geotiff_zip = "https://data.humdata.org/dataset/81932f5a-4aa5-4c72-b085-3bde5fef349c/resource/51687557-2934-4d90-98a6-f604a856fdaa/download/population_cod_2018-10-01.zip"
hash <- contentid::register(hd_pop_geotiff_zip)
#> Warning: Client error: (403) Forbidden
#> Warning: https://data.humdata.org/dataset/81932f5a-4aa5-4c72-
#> b085-3bde5fef349c/resource/51687557-2934-4d90-98a6-f604a856fdaa/download/
#> population_cod_2018-10-01.zip had error code 403
contentid::query(hash)
#> [1] source date  
#> <0 rows> (or 0-length row.names)
contentid::resolve(hash)
#> Warning in contentid::resolve(hash): No sources found for hash://sha256/
#> d9d414905e7770c762f32af6938ccbf166ddf0fa3daaa2eae15fff05f8cc0408
#> [1] NA
contentid:::default_tsv()
#> [1] "content_id/registry.tsv"
cat(readLines(contentid:::default_tsv()), sep = "\n")
#> identifier   source  date    size    status  md5 sha1    sha256  sha384  sha512
#> NA   NA  2021-04-17T17:07:57Z    NA  404 NA  NA  NA  NA  NA
#> NA   https://data.humdata.org/dataset/81932f5a-4aa5-4c72-b085-3bde5fef349c/resource/51687557-2934-4d90-98a6-f604a856fdaa/download/population_cod_2018-10-01.zip  2021-04-17T17:07:57Z    NA  404 NA  NA  NA  NA  NA

Created on 2021-04-17 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.5 (2021-03-31) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2021-04-17 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2) #> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.5) #> contentid 0.0.10 2021-04-16 [1] Github (cboettig/contentid@5591307) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) #> curl 4.3 2019-12-02 [1] CRAN (R 4.0.1) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3) #> knitr 1.32 2021-04-14 [1] CRAN (R 4.0.5) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) #> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) #> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.2) #> vroom 1.4.0 2021-02-01 [1] CRAN (R 4.0.2) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.2) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```
cboettig commented 3 years ago

Thanks for reporting, try now?

A while ago hash-archive.org was getting hammered, which was causing the server to time out on new register() requests. I switched the default over to my server. Looks like attempting to make all versions of our full rocker stack filled up my disk recently leading to some unexpected server errors.

Obviously either way this points to a level of fragility around the hash-archive-style registries, perhaps not wholly surprising.

There's a second issue you point out in that register() should still be registering in your default tsv registry, but it looks like somehow the NA from the hash-archive is getting returned instead of the local hash, so I'll look into that!

cboettig commented 3 years ago

@noamross

For some reason, httr tells me that your link header is a 403 error:

httr::status_code(httr::HEAD("https://data.humdata.org/dataset/81932f5a-4aa5-4c72-b085-3bde5fef349c/resource/51687557-2934-4d90-98a6-f604a856fdaa/download/population_cod_2018-10-01.zip"))
#> [1] 403

Created on 2021-04-19 by the reprex package (v1.0.0)

(You can confirm this on command line with curl -I -L).

This is weird, because curl_download() etc work fine on the URL, so I'm not sure what is up but suggests something about your S3 bucket configuration perhaps? Anyway, that's why the tsv registry you show above is just listing the URL with status 404 instead of successfully computing the identifier locally.

I'm not sure what the right solution is here. Currently content_id() checks the status code of the header of the URL before attempting to stream the file to download or hash-and-discard (streaming lets you register large files without consuming disk space). We do this because obviously URLs can throw any number of errors, and we'd rather record the error and move on than get stuck midway through streaming a potentially large file.

On the other hand, it's obviously annoying to fail here when we could have registered successfully had we just jumped in and tried to stream the file.

Also we record as a 404 when clearly this a 403. That's obviously wrong, but in extensive testing last year I discovered that some data repositories sometimes give back very unexpected/non-standard status codes (negative integers, text, curl errors) so rather than parse these, the status just defaults to 404. That should be fixed to give the right error code when available and then handle the non-standard errors separately.

any thoughts you have on the best option here would be great.

noamross commented 3 years ago

This isn't my data and I don't know what the back-end hosting is, but in my experience lots of web servers and services do not support HEAD (I think this includes S3). However, you can get the header by using GET and not downloading the body with curl::curl_fetch_memory(url, handle = curl::new_handle(nobody = TRUE))

noamross commented 3 years ago

Ah, not quite, you need:

curl::curl_fetch_memory(url, handle = curl::new_handle(nobody = TRUE, customrequest = "GET"))
cboettig commented 3 years ago

@noamross thanks! avoiding the HEAD request this way makes sense, really appreciate it.

Haven't had trouble with S3 registration before, e.g. with my MINIO server, or with GitHub-hosted assets which route to S3 as well, so not ruling out a config issue here, but still a good point about some services lacking a HEAD method.

Will still rely on curl returning a good status code, the trick is to avoid returning the content-hash of an error message when the server throws a 403 (or some other error). Will do some more testing with this.

jhpoelen commented 3 years ago

@noamross @cboettig great to see these real world issues with content registration! I think that reporting and teasing out these issues will work towards a more robust registration process. One thought that came to mind is to create some sort of "pool" of content registries with a client-side fail-over to that content can still be registered if one (of many) content registries happens to be unresponsive or otherwise unavailable.

jhpoelen commented 3 years ago

Another part of this would be that registries could act as registry mirrors or somehow synchronize or import content registration entries.

cboettig commented 3 years ago

@jhpoelen definitely agreed! hash-archive backend storage is written in LevelDB, https://github.com/btrask/hash-archive/blob/master/src/db.c#L25, unfortunately I've never figured out a convenient way to read data out of the LevelDB dump. hash-archive used to provide SQL dumps, but no longer does so: https://github.com/btrask/hash-archive/issues/1

If we had a utility that could read out and sync across the LevelDB files we could at least sync up self-hosted versions. Ideally this would be a feature added to hash-archive source code though, but the C code there is well beyond my skill.