Open arunge opened 3 months ago
@arunge I have a RSS source list: News Project RSS Feeds.docx
however, still missing are the selected fields for each. Jian will colaborate on this project and will help with this selection. Can you please add him as a project member so that he can comment?
@isomemo I invited Jian to this repository.
@policybot2020 Currently we use example rss feeds in our ETL. You can find the config here: https://github.com/Pandora-IsoMemo/rss-data/blob/main/config.yaml
Could you update the config file with respect to the list provided by @isomemo (see above)?
Please let me know if you have any questions! Thanks in advance! :blush:
@arunge @isomemo Is this the NLP text analysis project? I am not aware of this task.
@SarahWagner Please, check the document above for a list of sources. Let me know when you can roughly estimate how much storage will be needed if we want to save news feeds daily for a year for all the sources given in the document.
In the next step we can discuss options of storage and cleaning up. Thanks! :pray:
Please see:
@policybot2020
@isomemo mentioned that you will help to
Map the fields from each RSS source to the database table (https://github.com/Pandora-IsoMemo/rss-data/issues/1#issue-2386464673)
The list of RSS sources is given above in a word document.
We now require a complete list of sources of RSS feeds in the format below (compare with our config.yaml ):
# RSS sources
sources:
Source_1:
url: "http://rss.cnn.com/rss/edition_world.rss"
name: "CNN" # not used
description: "Edition World" # not used
category: "Example 1" # not used
id: 1
Source_2:
url: "http://feeds.bbci.co.uk/news/world/rss.xml"
name: "BBC"
description: "News World"
category: "Example 2"
id: 2
The fields url
and name
are required, description
and category
are optional.
Based on this config file each source will be read using: tidyRSS::tidyfeed(<url>)
You can test output with, e.g.
> tidyRSS::tidyfeed("http://rss.cnn.com/rss/edition_world.rss")
GET request successful. Parsing...
# A tibble: 29 × 14
feed_title feed_link feed_description feed_language feed_pub_date feed_last_build_date feed_generator feed_ttl item_title item_link item_description
<chr> <chr> <chr> <chr> <dttm> <dttm> <chr> <chr> <chr> <chr> <chr>
1 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Markets … https://… NA
2 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Still ha… https://… "So far this ta…
3 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Retail s… https://… "Spending at US…
4 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Analysis… https://… "This is it."
5 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Silicon … https://… "When customers…
6 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Not only… https://… "Lake Powell, t…
7 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "These we… https://… "Air pollution …
8 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Big-box … https://… "As the US atte…
9 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Look of … https://… "Bringing the s…
10 CNN.com - RSS Channel - … https://… CNN.com deliver… en-US 2023-04-24 22:55:36 2024-08-22 15:32:30 coredev-bumbl… 10 "Scientis… https://… "\"Old Masters\…
# ℹ 19 more rows
# ℹ 3 more variables: item_pub_date <dttm>, item_guid <chr>, item_category <list>
# ℹ Use `print(n = ...)` to see more rows
For our examples tidyfeed()
always returned the same columns:
c("feed_title", "feed_link", "feed_description", "feed_language",
"feed_pub_date", "feed_last_build_date", "feed_generator", "feed_ttl",
"item_title", "item_link", "item_description", "item_pub_date",
"item_guid", "item_category")
Therefore, we have currently one mapping for all sources and use following of the columns:
# Mapping for all fields of the rss sources
# Add columns as needed
# Only change the values (right from the colon), not the keys!
rssMapping:
source_id: "source_id" # generated here
source: "feed_title"
title: "item_title"
link: "item_link"
text: "item_text"
timestamp_feed_updated: "feed_last_build_date"
timestamp_item_published: "item_pub_date"
If fields are missing our ETL does catch this and gives a warning:
However, at least "item_link"
or "item_text"
should be available.
Could you please
config.yaml
."item_link"
and "item_text"
are missing?Let me know if you have questions! Thanks in advance! :pray:
@isomemo @arunge
Admin: can you add me as a team member for this rss-data
REPO again? The time for me to accept the invitation has expired, then i can make a pull request.
Questions on SOURCES:
https://en.mehrnews.com/rss-help
is NOT a RSS feed format, it's a list of rss pages, are you trying to see the world news page? if yes then it's this URL: https://en.mehrnews.com/rss/tp/561
https://www.feedspot.com/infiniterss.php?_src=feed_title&followfeedid=4371835&q=site:https%3A%2F%2Fwww.rt.com%2Frss%2F
(this does not work), possible fix: https://www.rt.com/rss/
https://corporate.dw.com/en/rss/s-31500
error, possible fix: "https://rss.dw.com/xml/rss-en-all"
http://www.chinadaily.com.cn/rss/index.html
is NOT RSS feed, possible fix: http://www.chinadaily.com.cn/rss/world_rss.xml
https://www.france24.com/en/rss-feeds
same issue, need to pick a certain section of the newsSource_0: url: "http://rss.cnn.com/rss/edition_world.rss" name: "CNN" # not used description: "Edition World" # not used category: "Example 1" # not used id: 1
Source_1: url: "https://rss.nytimes.com/services/xml/rss/nyt/World.xml" name: "New York Times" description: "World News" category: "International" id: 1
Source_2: url: "https://moxie.foxnews.com/google-publisher/world.xml" name: "Fox News" description: "World News" category: "International" id: 2
Source_3: url: "http://feeds.bbci.co.uk/news/world/rss.xml" name: "BBC News" description: "World News" category: "International" id: 3
Source_4: url: "https://www.theguardian.com/world/rss" name: "The Guardian" description: "World News" category: "International" id: 4
Source_5: url: "https://www.aljazeera.com/xml/rss/all.xml" name: "Al Jazeera" description: "All News" category: "International" id: 5
Source_6: url: "https://timesofindia.indiatimes.com/rssfeeds/296589292.cms" name: "Times of India" description: "World News" category: "International" id: 6
Source_7: url: "https://www.rt.com/rss/" name: "RT" description: "All News" category: "International" id: 7
Source_8: url: "https://rss.dw.com/xml/rss-en-all" name: "DW News" description: "All News" category: "International" id: 8
Source_9: url: "https://en.mehrnews.com/rss/tp/561" name: "Mehr News" description: "Top Stories" category: "International" id: 9
Source_10: url: "http://www.chinadaily.com.cn/rss/world_rss.xml" name: "China Daily" description: "World News" category: "International" id: 10
Source_11: url: "https://www.scmp.com/rss/91/feed" name: "South China Morning Post" description: "World News" category: "International" id: 11
Source_12: url: "https://www.telesurenglish.net/pages/rss.html" name: "Telesur" description: "World News" category: "International" id: 12
Source_13: url: "https://www.france24.com/en/rss" name: "France 24" description: "Top Stories" category: "International" id: 13
Source_14: url: "https://www.news24.com/World/Rss" name: "News24" description: "World News" category: "International" id: 14
Source_15: url: "https://www.sowetanlive.co.za/rss/?section=news" name: "Sowetan" description: "News" category: "International" id: 15
Source_16: url: "https://punchng.com/feed/" name: "The Punch" description: "All News" category: "International" id: 16
@isomemo @arunge let me know if these links work for you.
sometimes news and propaganda in the own domestic news (e,g. in Russia and China), reflects different sentiments and views from the English version, I have some other potential sources that are in the news native language that could be included, depending on our research question. @isomemo Should we have a call about the research question?
@policybot2020 regarding:
Admin: can you add me as a team member for this rss-data REPO again? The time for me to accept the invitation has expired, then i can make a pull request.
I re-invited you, please check your mails! :slightly_smiling_face:
@policybot2020 Regarding your list of sources, could you integrate your list directly into the config file in the following format and open a pull request?
@arunge I can do that.
I also want to add "author name" into the mapping if that's possible.
Jian
From https://github.com/Pandora-IsoMemo/rss-data/issues/1#issue-2386464673:
@SarahWagner you suggested to use MongoDB in the future, which is better for large datasets than our current database. I will discuss options with @isomemo in our next meeting.
Currently, we only have example data and write it into the existing database. In order to understand how large our data will be, we need at least some of the RSS sources that we will use in the future for analysis.
@isomemo Could you provide a list of RSS sources?
Then we can check the sources and make suggestions regarding the specifications for the new server.