Open arunge opened 4 months ago
isodb specs:
CPU: 4x Single Core (4-Die): AMD EPYC 7443 type: MCM SMP speed: 2850 MHz
Drives: Local Storage: total: 550.00 GiB used: 131.67 GiB (23.9%)
Memory: 23.44 GiB used: 3.75 GiB (16.0%)
@SarahWagner
tidyRSS::tidyfeed()
Use the tidyRSS R package and the function tidyfeed to read RSS sources
# Feed CNN
feed1 <- tidyRSS::tidyfeed("http://rss.cnn.com/rss/edition_world.rss")
# Extract a link from the feed, here the first
link <- feed1[[mapping()[["Link"]]]][[1]]
# Fetch the content of the link
page <- httr::GET(link, httr::timeout(30))
page_content <- httr::content(page, as = "text")
# Parse the HTML content
html <- rvest::read_html(page_content)
# Extract the main text content
# You might need to adjust the CSS selector based on the specific structure of the web page
html %>%
rvest::html_nodes("p") %>% # Adjust this selector to target the main content
rvest::html_text() %>%
stringr::str_squish() %>%
paste(collapse = " ") # Combine all text nodes into a single string
in order to setup the configuration
A configuration R file is used to:
- List the RSS sources to be checked
- Map the fields from each RSS source to the database table
we discussed to define a yaml
file, which might look like follows:
inst/config.yaml
# RSS sources
source:
Example:
url: "https://www.example.com/feed"
name: "Example Feed"
description: "This is an example feed"
category: "Example"
# Mapping for all columns of the target table on the database
# Please, add columns as needed (e.g. `date_last_updated`, `id`, ...)!
fieldMapping:
Example:
source: "feed_title"
title: "item_title"
link: "item_link"
text: "item_text"
Feel free to open issues to your needs! Please, let me know if you have questions! :blush:
Example:
Let’s assume that our database has three fields: “Publication date”, “Title”, and “Text” In the configuration file we define the news sources and which RSS fields match the database fields. Here exemplified for CNN World and BCC
CNN
address: http://rss.cnn.com/rss/edition_world.rss Publication date: 5 Title: 9 Text: 11
BCC
address: http://feeds.bbci.co.uk/news/world/rss.xml Publication date: … Title: … Text: …
Function to fetch data: my_data <- tidyfeed("http://rss.cnn.com/rss/edition_world.rss")
Then we use the numbers above to match database columns to RSS feed from each source.
So for istance for CNN publication date this is stored in column 5 of the RSS matrix.
my_data[5]
There would be a loop to run through each source and a second loop to run within the matrix of each source.
IMPORTANT: