Pandora-IsoMemo / rss-data

ETL for RSS fetched texts
0 stars 0 forks source link

List of actual RSS sources for a new database server #3

Open arunge opened 3 months ago

arunge commented 3 months ago

From https://github.com/Pandora-IsoMemo/rss-data/issues/1#issue-2386464673:

our IT will setup a new server. The database system to be used should allow for a very large content.

@SarahWagner you suggested to use MongoDB in the future, which is better for large datasets than our current database. I will discuss options with @isomemo in our next meeting.

Currently, we only have example data and write it into the existing database. In order to understand how large our data will be, we need at least some of the RSS sources that we will use in the future for analysis.

@isomemo Could you provide a list of RSS sources?

Then we can check the sources and make suggestions regarding the specifications for the new server.

isomemo commented 2 months ago

@arunge I have a RSS source list: News Project RSS Feeds.docx

however, still missing are the selected fields for each. Jian will colaborate on this project and will help with this selection. Can you please add him as a project member so that he can comment?

arunge commented 2 months ago

@isomemo I invited Jian to this repository.

@policybot2020 Currently we use example rss feeds in our ETL. You can find the config here: https://github.com/Pandora-IsoMemo/rss-data/blob/main/config.yaml

Could you update the config file with respect to the list provided by @isomemo (see above)?

Please let me know if you have any questions! Thanks in advance! :blush:

policybot2020 commented 2 months ago

@arunge @isomemo Is this the NLP text analysis project? I am not aware of this task.

arunge commented 1 month ago

@SarahWagner Please, check the document above for a list of sources. Let me know when you can roughly estimate how much storage will be needed if we want to save news feeds daily for a year for all the sources given in the document.

In the next step we can discuss options of storage and cleaning up. Thanks! :pray:

Please see:

arunge commented 1 month ago

@policybot2020

@isomemo mentioned that you will help to

Map the fields from each RSS source to the database table (https://github.com/Pandora-IsoMemo/rss-data/issues/1#issue-2386464673)

The list of RSS sources is given above in a word document.

We now require a complete list of sources of RSS feeds in the format below (compare with our config.yaml ):

# RSS sources
sources:
  Source_1:
    url: "http://rss.cnn.com/rss/edition_world.rss"
    name: "CNN" # not used
    description: "Edition World" # not used
    category: "Example 1" # not used
    id: 1
  Source_2:
    url: "http://feeds.bbci.co.uk/news/world/rss.xml"
    name: "BBC"
    description: "News World"
    category: "Example 2"
    id: 2

The fields url and name are required, description and category are optional.


Based on this config file each source will be read using: tidyRSS::tidyfeed(<url>)

You can test output with, e.g.

> tidyRSS::tidyfeed("http://rss.cnn.com/rss/edition_world.rss")
GET request successful. Parsing...

# A tibble: 29 × 14
   feed_title                feed_link feed_description feed_language feed_pub_date       feed_last_build_date feed_generator feed_ttl item_title item_link item_description
   <chr>                     <chr>     <chr>            <chr>         <dttm>              <dttm>               <chr>          <chr>    <chr>      <chr>     <chr>           
 1 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Markets … https://…  NA             
 2 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Still ha… https://… "So far this ta…
 3 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Retail s… https://… "Spending at US…
 4 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Analysis… https://… "This is it."   
 5 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Silicon … https://… "When customers…
 6 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Not only… https://… "Lake Powell, t…
 7 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "These we… https://… "Air pollution …
 8 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Big-box … https://… "As the US atte…
 9 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Look of … https://… "Bringing the s…
10 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US         2023-04-24 22:55:36 2024-08-22 15:32:30  coredev-bumbl… 10       "Scientis… https://… "\"Old Masters\…
# ℹ 19 more rows
# ℹ 3 more variables: item_pub_date <dttm>, item_guid <chr>, item_category <list>
# ℹ Use `print(n = ...)` to see more rows

For our examples tidyfeed() always returned the same columns:

c("feed_title", "feed_link", "feed_description", "feed_language", 
"feed_pub_date", "feed_last_build_date", "feed_generator", "feed_ttl", 
"item_title", "item_link", "item_description", "item_pub_date", 
"item_guid", "item_category")

Therefore, we have currently one mapping for all sources and use following of the columns:

# Mapping for all fields of the rss sources
# Add columns as needed
# Only change the values (right from the colon), not the keys!
rssMapping:
  source_id: "source_id" # generated here
  source: "feed_title"
  title: "item_title"
  link: "item_link"
  text: "item_text"
  timestamp_feed_updated: "feed_last_build_date"
  timestamp_item_published: "item_pub_date"

If fields are missing our ETL does catch this and gives a warning:

https://github.com/Pandora-IsoMemo/rss-data/blob/cec72249157ab5fa863656a88d1183d6891bddd7/R/02_read_rss.R#L19-L33

However, at least "item_link" or "item_text" should be available.

Could you please

Let me know if you have questions! Thanks in advance! :pray:

policybot2020 commented 2 weeks ago

@isomemo @arunge Admin: can you add me as a team member for this rss-data REPO again? The time for me to accept the invitation has expired, then i can make a pull request.

Questions on SOURCES:

  1. MEHRNEWS: https://en.mehrnews.com/rss-help is NOT a RSS feed format, it's a list of rss pages, are you trying to see the world news page? if yes then it's this URL: https://en.mehrnews.com/rss/tp/561
  2. RT news: https://www.feedspot.com/infiniterss.php?_src=feed_title&followfeedid=4371835&q=site:https%3A%2F%2Fwww.rt.com%2Frss%2F (this does not work), possible fix: https://www.rt.com/rss/
  3. DW news: https://corporate.dw.com/en/rss/s-31500 error, possible fix: "https://rss.dw.com/xml/rss-en-all"
  4. China Daily: http://www.chinadaily.com.cn/rss/index.html is NOT RSS feed, possible fix: http://www.chinadaily.com.cn/rss/world_rss.xml
  5. SCMP (south china morning post): 'https://www.scmp.com/rss` same issue, need to pick a certain section of the news
  6. France 24: https://www.france24.com/en/rss-feeds same issue, need to pick a certain section of the news
policybot2020 commented 2 weeks ago

PROPOSED NEW RSS URLS LINKS:

Source_0: url: "http://rss.cnn.com/rss/edition_world.rss" name: "CNN" # not used description: "Edition World" # not used category: "Example 1" # not used id: 1

Source_1: url: "https://rss.nytimes.com/services/xml/rss/nyt/World.xml" name: "New York Times" description: "World News" category: "International" id: 1

Source_2: url: "https://moxie.foxnews.com/google-publisher/world.xml" name: "Fox News" description: "World News" category: "International" id: 2

Source_3: url: "http://feeds.bbci.co.uk/news/world/rss.xml" name: "BBC News" description: "World News" category: "International" id: 3

Source_4: url: "https://www.theguardian.com/world/rss" name: "The Guardian" description: "World News" category: "International" id: 4

Source_5: url: "https://www.aljazeera.com/xml/rss/all.xml" name: "Al Jazeera" description: "All News" category: "International" id: 5

Source_6: url: "https://timesofindia.indiatimes.com/rssfeeds/296589292.cms" name: "Times of India" description: "World News" category: "International" id: 6

Source_7: url: "https://www.rt.com/rss/" name: "RT" description: "All News" category: "International" id: 7

Source_8: url: "https://rss.dw.com/xml/rss-en-all" name: "DW News" description: "All News" category: "International" id: 8

Source_9: url: "https://en.mehrnews.com/rss/tp/561" name: "Mehr News" description: "Top Stories" category: "International" id: 9

Source_10: url: "http://www.chinadaily.com.cn/rss/world_rss.xml" name: "China Daily" description: "World News" category: "International" id: 10

Source_11: url: "https://www.scmp.com/rss/91/feed" name: "South China Morning Post" description: "World News" category: "International" id: 11

Source_12: url: "https://www.telesurenglish.net/pages/rss.html" name: "Telesur" description: "World News" category: "International" id: 12

Source_13: url: "https://www.france24.com/en/rss" name: "France 24" description: "Top Stories" category: "International" id: 13

Source_14: url: "https://www.news24.com/World/Rss" name: "News24" description: "World News" category: "International" id: 14

Source_15: url: "https://www.sowetanlive.co.za/rss/?section=news" name: "Sowetan" description: "News" category: "International" id: 15

Source_16: url: "https://punchng.com/feed/" name: "The Punch" description: "All News" category: "International" id: 16

@isomemo @arunge let me know if these links work for you.

policybot2020 commented 2 weeks ago

New Sources depending on research question:

sometimes news and propaganda in the own domestic news (e,g. in Russia and China), reflects different sentiments and views from the English version, I have some other potential sources that are in the news native language that could be included, depending on our research question. @isomemo Should we have a call about the research question?

arunge commented 2 weeks ago

@policybot2020 regarding:

Admin: can you add me as a team member for this rss-data REPO again? The time for me to accept the invitation has expired, then i can make a pull request.

I re-invited you, please check your mails! :slightly_smiling_face:

arunge commented 2 weeks ago

@policybot2020 Regarding your list of sources, could you integrate your list directly into the config file in the following format and open a pull request?

https://github.com/Pandora-IsoMemo/rss-data/blob/cec72249157ab5fa863656a88d1183d6891bddd7/config.yaml#L9-L14

policybot2020 commented 1 week ago

@arunge I can do that.

I also want to add "author name" into the mapping if that's possible.

Jian

arunge commented 1 week ago

@SarahWagner Could you please give an example of how to add a new column "author_name" to the config file?