arunge commented 4 months ago

Save daily RSS texts into a central database table
Use the tidyRSS R package and the function tidyfeed to read RSS sources
The database table will have a fixed column structure
A configuration R file is used to:
- List the RSS sources to be checked
- Map the fields from each RSS source to the database table
- Determine how often a check is made for new entries from each RSS source. Note that, RSS sources are often updated more than once per day. Thus, a check should also be made if data is not transferred to the database in duplicate.
An option to download the database -> rss-api

Example:

Let’s assume that our database has three fields: “Publication date”, “Title”, and “Text” In the configuration file we define the news sources and which RSS fields match the database fields. Here exemplified for CNN World and BCC

CNN

address: http://rss.cnn.com/rss/edition_world.rss Publication date: 5 Title: 9 Text: 11

BCC

address: http://feeds.bbci.co.uk/news/world/rss.xml Publication date: … Title: … Text: …

Function to fetch data: my_data <- tidyfeed("http://rss.cnn.com/rss/edition_world.rss")

Then we use the numbers above to match database columns to RSS feed from each source.

So for istance for CNN publication date this is stored in column 5 of the RSS matrix.

my_data[5]

There would be a loop to run through each source and a second loop to run within the matrix of each source.

IMPORTANT:

There should be some error control since some sources may be “down”
Checking for new texts should be done a certain number of times per day as defined in configuration file.
our IT will setup a new server. The database system to be used should allow for a very large content.
The download app interface should be very simple.
ONE new thing: in some cases the text is not given in the RSS only an URL link. For these, the configuration life should instruct that the text is to be fetched from the source (e.g. using GET)
In the future: new tasks to analyse the data

jan-abel-inwt commented 4 months ago

isodb specs:

CPU:       4x Single Core (4-Die): AMD EPYC 7443 type: MCM SMP speed: 2850 MHz
Drives:    Local Storage: total: 550.00 GiB used: 131.67 GiB (23.9%)
Memory: 23.44 GiB used: 3.75 GiB (16.0%)

arunge commented 4 months ago

@SarahWagner

You may check the iso-data repository as a draft for an ETL that we are already using, or use another draft
as mentioned above we can use tidyRSS::tidyfeed()

Use the tidyRSS R package and the function tidyfeed to read RSS sources

# Feed CNN
feed1 <- tidyRSS::tidyfeed("http://rss.cnn.com/rss/edition_world.rss")

# Extract a link from the feed, here the first
link <- feed1[[mapping()[["Link"]]]][[1]]

# Fetch the content of the link
page <- httr::GET(link, httr::timeout(30))
page_content <- httr::content(page, as = "text")

# Parse the HTML content
html <- rvest::read_html(page_content)

# Extract the main text content
# You might need to adjust the CSS selector based on the specific structure of the web page
html %>%
  rvest::html_nodes("p") %>% # Adjust this selector to target the main content
  rvest::html_text() %>%
  stringr::str_squish() %>%
  paste(collapse = " ") # Combine all text nodes into a single string

in order to setup the configuration
A configuration R file is used to:
- List the RSS sources to be checked
- Map the fields from each RSS source to the database table
we discussed to define a yaml file, which might look like follows:

inst/config.yaml

# RSS sources
source:
  Example:
    url: "https://www.example.com/feed"
    name: "Example Feed"
    description: "This is an example feed"
    category: "Example"

# Mapping for all columns of the target table on the database
# Please, add columns as needed (e.g. `date_last_updated`, `id`, ...)!
fieldMapping:
  Example:
    source: "feed_title"
    title: "item_title"
    link: "item_link"
    text: "item_text"

Feel free to open issues to your needs! Please, let me know if you have questions! :blush:

Pandora-IsoMemo / rss-data

Saving RSS fetched text in a central database #1

CNN

BCC