UrbanInstitute / nccs

NCCS data platform powered by Jekyll
https://urbaninstitute.github.io/nccs/
6 stars 8 forks source link

make_archive_urls test for valid URL fails #11

Open lecy opened 6 months ago

lecy commented 6 months ago

In the make_archive_urls() function within build-catalog-functions.R the test for valid URL is failing.

For example,

x <- "https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC"
(RCurl::url.exists(x))
[1] FALSE

The URL works fine:

https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC

Any ideas?

lecy commented 6 months ago

Here's some reproducible code to test with the SOI dataset. It is currently returning https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html for everything:

library( dplyr )
library( knitr )
library( kableExtra )
library( stringr )
library( flextable )
library( pander )

GH.RAW <- "https://raw.githubusercontent.com/UrbanInstitute/nccs/main/catalogs/"
d <- read.csv( paste0( GH.RAW, "AWS-NCCSDATA.csv" ) )
source( paste0( GH.RAW, "build-catalog-functions.R" ) )

series <- "soi"

paths <- get_file_paths(series = "soi",
                        paths = d$Key,
                        tscope = "CHARITIES",
                        fscope =  "PC" )

profile_urls <- make_archive_urls( series = "soi", paths = paths )  

make_archive_urls <- function(series,
                              paths){

  base_url = sprintf("https://urbaninstitute.github.io/nccs-legacy/dictionary/%s/%s_archive_html/",
                     series,
                     series)

  expr_dic = list("core" = "legacy/core/",
                  "bmf" = "legacy/bmf/",
                  "misc" = "legacy/misc/",
                  "soi" = "legacy/soi-micro/[0-9]{4}/")

  unavail_url <- "https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html"

  matches <- gsub(expr_dic[[series]], "", paths)
  matches <- gsub("\\.csv", "", matches)

  archive_urls <- paste0(base_url, matches)
  archive_urls <- lapply(archive_urls, 
                         function(x) if (RCurl::url.exists(x)) x else unavail_url)

  return(archive_urls)  
}
lecy commented 5 months ago

For the time I just commented out the validation line:

  # archive_urls <- lapply(archive_urls, 
  #                        function(x) if (RCurl::url.exists(x)) x else unavail_url)

Worst case the user gets a 404 instead of a "dictionary unavailable" message. Will look into an alternative URL validation function.

lecy commented 5 months ago

I saw your note that you could not replicate the behavior. Same here when I try with this same example:

> x <- "https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC"
> (RCurl::url.exists(x))
[1] TRUE

It could have just been a slow server or perhaps those pages are generated dynamically when requested so there is a delay, but whatever the case there are many instances where the RCurl check will fail when the URLs are actually valid.

Unless we have a function that we can trust it's probably better to not remove the links if the test fails because it will result in the kind of file the user mentioned - none of the data dictionary buttons had associated URLS on the download page for the SOI Microdata files (all of the valid ones were dropped when the file was rendered).

If the URL is added and does not actually exist then the user just gets a 404 message. That seems like the lesser of the two problems.