OCHA-DAP / hdx-signals

HDX Signals
https://un-ocha-centre-for-humanitarian.gitbook.io/hdx-signals/
GNU General Public License v3.0
6 stars 0 forks source link

Validation of links and sources #72

Open caldwellst opened 6 months ago

caldwellst commented 6 months ago

When generating campaign content, we curate a variety of links for shocks. These are:

We should be able to simply use httr2 to perform a HEAD request, I think, but my understanding of all of this is relatively limited, so be good to explore when implementing. Something like the below

url_valid <- function(url) {
  tryCatch(
    {
      resp_status <- httr2::request(
        url
      ) |> 
        httr2::req_method(
          "HEAD"
        ) |> 
        httr2::req_perform() |> 
        httr2::resp_status()

      resp_status == 200
    },
    error = \(e) FALSE
  )
}

url_valid("a")
url_valid("https://data.humdata.org")
url_valid("https://data.humdatas.org")

This would be used in generate_campaign_content.R for validation, but also could potentially be used when curating URLs. For instance, IDMC often has 5+ sources that we are choosing from, so this could be a useful filter for only selecting valid URLs.

hannahker commented 6 months ago

@caldwellst tested this function and it seems to be working well! Not super sure where to apply it in generate_campaign_content though.

caldwellst commented 6 months ago

Nice! Maybe in generate_info() we valid all the URLs there after generation? Then we can also use it to filter out URLs in IDMC. I don't think we would need to validate the URLs returned from Mailchimp in the other functions for the plots or campaigns, but what do you think?

caldwellst commented 6 months ago

I think gonna be very difficult to implement in practice. I haven't explored bypassing these issues or how to get around, but have encountered these issues on sites that are perfectly valid for browsers:

# 401 authorization errors
url_valid("https://www.reuters.com/world/africa/hunger-grips-southern-africa-zimbabwe-declares-drought-disaster-2024-04-03/")

# 403 forbidden errors
url_valid("https://www.herald.co.zw/locusts-destroy-nearly-8-000-hectares-of-crops/")

# 404 not found errors, that still redirect correctly
url_valid("https://www.wmolc.org/seasonPmmeUI/view?winName=PlotView1638350381355")

# SSL certificate errors, often for PDFs that require you to manually say you trust the site
url_valid("http://www.fsnau.org/downloads/FEWS-NET-FSNAU-Somalia-Food-Security-Outlook-Report-for-June-2019-to-Jan-2020.pdf")

There are numerous instances of this in the JRC ASAP data, and given other issues on their URL link formatting, will at least need them to consistently format URLs within href tags prior to inclusion in our data.

caldwellst commented 5 months ago

Definitely not possible in any easy way, so not part of public release. Related to #104 where we face similar restrictions accessing links from GitHub Actions.