Open caldwellst opened 6 months ago
@caldwellst tested this function and it seems to be working well! Not super sure where to apply it in generate_campaign_content
though.
Nice! Maybe in generate_info()
we valid all the URLs there after generation? Then we can also use it to filter out URLs in IDMC. I don't think we would need to validate the URLs returned from Mailchimp in the other functions for the plots or campaigns, but what do you think?
I think gonna be very difficult to implement in practice. I haven't explored bypassing these issues or how to get around, but have encountered these issues on sites that are perfectly valid for browsers:
# 401 authorization errors
url_valid("https://www.reuters.com/world/africa/hunger-grips-southern-africa-zimbabwe-declares-drought-disaster-2024-04-03/")
# 403 forbidden errors
url_valid("https://www.herald.co.zw/locusts-destroy-nearly-8-000-hectares-of-crops/")
# 404 not found errors, that still redirect correctly
url_valid("https://www.wmolc.org/seasonPmmeUI/view?winName=PlotView1638350381355")
# SSL certificate errors, often for PDFs that require you to manually say you trust the site
url_valid("http://www.fsnau.org/downloads/FEWS-NET-FSNAU-Somalia-Food-Security-Outlook-Report-for-June-2019-to-Jan-2020.pdf")
There are numerous instances of this in the JRC ASAP data, and given other issues on their URL link formatting, will at least need them to consistently format URLs within href
tags prior to inclusion in our data.
Definitely not possible in any easy way, so not part of public release. Related to #104 where we face similar restrictions accessing links from GitHub Actions.
When generating campaign content, we curate a variety of links for shocks. These are:
hdx_url
: a link to the HDX dataset for the indicatorsource_url
: the main URL provided as a source for the signalsother_urls
: a concatenated string of all other URLs used in the campaign. For instance, for IDMC, these are the 3 most recent source documents cited in their dataWe should be able to simply use
httr2
to perform aHEAD
request, I think, but my understanding of all of this is relatively limited, so be good to explore when implementing. Something like the belowThis would be used in
generate_campaign_content.R
for validation, but also could potentially be used when curating URLs. For instance, IDMC often has 5+ sources that we are choosing from, so this could be a useful filter for only selecting valid URLs.