Pandora-IsoMemo / rss-data

ETL for RSS fetched texts
0 stars 0 forks source link

Saving RSS fetched text in a central database #1

Open arunge opened 3 months ago

arunge commented 3 months ago

Example:

Let’s assume that our database has three fields: “Publication date”, “Title”, and “Text” In the configuration file we define the news sources and which RSS fields match the database fields. Here exemplified for CNN World and BCC

CNN

address: http://rss.cnn.com/rss/edition_world.rss Publication date: 5 Title: 9 Text: 11

BCC

address: http://feeds.bbci.co.uk/news/world/rss.xml Publication date: … Title: … Text: …

Function to fetch data: my_data <- tidyfeed("http://rss.cnn.com/rss/edition_world.rss")

Then we use the numbers above to match database columns to RSS feed from each source.

So for istance for CNN publication date this is stored in column 5 of the RSS matrix.

my_data[5]

There would be a loop to run through each source and a second loop to run within the matrix of each source.

IMPORTANT:

jan-abel-inwt commented 3 months ago

isodb specs:

CPU:       4x Single Core (4-Die): AMD EPYC 7443 type: MCM SMP speed: 2850 MHz
Drives:    Local Storage: total: 550.00 GiB used: 131.67 GiB (23.9%)
Memory: 23.44 GiB used: 3.75 GiB (16.0%)
arunge commented 3 months ago

@SarahWagner

# Feed CNN
feed1 <- tidyRSS::tidyfeed("http://rss.cnn.com/rss/edition_world.rss")

# Extract a link from the feed, here the first
link <- feed1[[mapping()[["Link"]]]][[1]]

# Fetch the content of the link
page <- httr::GET(link, httr::timeout(30))
page_content <- httr::content(page, as = "text")

# Parse the HTML content
html <- rvest::read_html(page_content)

# Extract the main text content
# You might need to adjust the CSS selector based on the specific structure of the web page
html %>%
  rvest::html_nodes("p") %>% # Adjust this selector to target the main content
  rvest::html_text() %>%
  stringr::str_squish() %>%
  paste(collapse = " ") # Combine all text nodes into a single string

inst/config.yaml

# RSS sources
source:
  Example:
    url: "https://www.example.com/feed"
    name: "Example Feed"
    description: "This is an example feed"
    category: "Example"

# Mapping for all columns of the target table on the database
# Please, add columns as needed (e.g. `date_last_updated`, `id`, ...)!
fieldMapping:
  Example:
    source: "feed_title"
    title: "item_title"
    link: "item_link"
    text: "item_text"

Feel free to open issues to your needs! Please, let me know if you have questions! :blush: