RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

'entry_link' field repeated for all entries? #70

Closed Arf9999 closed 1 year ago

Arf9999 commented 1 year ago

I'm using tidyRSS on a google alert rss feed.

As follows:

library(tidyverse)
library(tidyRSS)

testfeed <- tidyfeed("https://www.google.com/alerts/feeds/10423121203014744224/5457667855038883028")
> head(testfeed)
# A tibble: 6 × 11
  feed_title           feed_url         feed_last_updated   feed_…¹ entry…² entry…³ entry_last_updated  entry…⁴ entry…⁵ entry_cate…⁶ entry_published    
  <chr>                <chr>            <dttm>              <chr>   <chr>   <chr>   <dttm>              <chr>   <chr>   <list>       <dttm>             
1 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… DA tab… tag:go… 2023-02-28 08:55:58 Ex-CEO… https:… <named list> 2023-02-28 08:55:58
2 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… Renald… tag:go… 2023-02-28 08:54:00 &#39;T… https:… <named list> 2023-02-28 08:54:00
3 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… ANC&#3… tag:go… 2023-02-28 08:36:04 This i… https:… <named list> 2023-02-28 08:36:04
4 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… <b>Esk… tag:go… 2023-02-28 08:33:52 Mantsh… https:… <named list> 2023-02-28 08:33:52
5 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… Sikona… tag:go… 2023-02-28 08:19:49 Eskom … https:… <named list> 2023-02-28 08:19:49
6 Google Alert - Eskom tag:google.com,… 2023-02-28 08:55:58 https:… Learne… tag:go… 2023-02-28 07:55:38 Grade … https:… <named list> 2023-02-28 07:55:38
# … with abbreviated variable names ¹​feed_link, ²​entry_title, ³​entry_url, ⁴​entry_content, ⁵​entry_link, ⁶​entry_category

All looks good, except...

> head(testfeed$entry_link)
[1] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"
[2] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"
[3] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"
[4] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"
[5] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"
[6] "https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"

All entry_link urls are identical. If I check with the google rss result, that isn't the case.

<feed xmlns="http://www.w3.org/2005/Atom" xmlns:idx="urn:atom-extension:indexing">
<id>tag:google.com,2005:reader/user/10423121203014744224/state/com.google/alerts/5457667855038883028</id>
<title>Google Alert - Eskom</title>
<link href="https://www.google.com/alerts/feeds/10423121203014744224/5457667855038883028" rel="self"/>
<updated>2023-02-28T08:55:58Z</updated>
<entry>
<id>tag:google.com,2013:googlealerts/feed:15717783206228794606</id>
<title type="html">DA tables urgent establishment of ad hoc parliamentary committee to investigate corruption at <b>Eskom</b></title>
<link href="https://www.google.com/url?rct=j&sa=t&url=https://www.da.org.za/2023/02/da-tables-urgent-establishment-of-ad-hoc-parliamentary-committee-to-investigate-corruption-at-eskom&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw20IiW5cUdIt7JR8frfATTj"/>
<published>2023-02-28T08:55:58Z</published>
<updated>2023-02-28T08:55:58Z</updated>
<content type="html">Ex-CEO Andre De Ruyter&#39;s eNCA interview last week strongly suggests that <b>Eskom</b> power stations, the heartbeat of the nation&#39;s economy,&nbsp;...</content>
<author>
<name/>
</author>
</entry>
<entry>
<id>tag:google.com,2013:googlealerts/feed:15977829568103720142</id>
<title type="html">Renaldo Gouws&#39; solar panel &#39;excrement&#39; upsets <b>Eskom</b> - The Citizen</title>
<link href="https://www.google.com/url?rct=j&sa=t&url=https://www.citizen.co.za/news/renaldo-gouws-solar-panel-excrement-upsets-eskom/&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw2Qe_n8GcZWLmOVLts9e2pI"/>
<published>2023-02-28T08:54:00Z</published>
<updated>2023-02-28T08:54:00Z</updated>
<content type="html">&#39;That was a demonstration of a microgrid <b>Eskom</b> is rolling out across the country to communities far away from grid connections.&#39;</content>
<author>
<name/>
</author>
</entry>
<entry>
<id>tag:google.com,2013:googlealerts/feed:7459851617020428847</id>
<title type="html">ANC&#39;s reaction to <b>Eskom</b> revelations exposes a party in... - Daily Maverick</title>
<link href="https://www.google.com/url?rct=j&sa=t&url=https://www.dailymaverick.co.za/article/2023-02-27-ancs-reaction-to-eskom-revelations-exposes-a-party-in-denial-of-reality-and-in-a-deep-ethical-crisis/&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw1PM6p9T631rbNH7WkLQJuQ"/>
<published>2023-02-28T08:36:04Z</published>
<updated>2023-02-28T08:36:04Z</updated>
<content type="html">This is what makes its response to the claims of corruption at <b>Eskom</b> so startling — there appears to be no understanding of how vulnerable the&nbsp;...</content>
<author>
<name/>
</author>
</entry>
<entry>
<id>tag:google.com,2013:googlealerts/feed:7129261029279881155</id>
<title type="html"><b>Eskom</b> spokesman Sikonathi Mantshantsha to step down - TechCentral</title>
<link href="https://www.google.com/url?rct=j&sa=t&url=https://techcentral.co.za/eskom-spokesman-sikonathi-mantshantsha-to-step-down/222698/&ct=ga&cd=CAIyHTg0MDE1ZWRmMjE0M2Y0MmU6Y29tOmVuOlpBOlJM&usg=AOvVaw0IJEFPWb7vJtyRFEPZybWP"/>
<published>2023-02-28T08:33:52Z</published>
<updated>2023-02-28T08:33:52Z</updated>
<content type="html">Mantshantsha joined <b>Eskom</b> after a career in journalism that included writing critically about the state-owned electricity utility for publications&nbsp;...</content>
<author>
<name/>
</author>
</entry>

...etc....

Is this an issue with google or something that I can adjust in the settings?

Arf9999 commented 1 year ago

Update: I've found the issue. It is within the atom_parse function

If I adjust the following code (lines 32 - 47) of atom_parse.R it works correctly for my usage:

e_link <- xml_find_first(res_entry, glue("{ns_entry}:link")) %>%
    xml_attr("href")

  # optional
  entries <- tibble(
    entry_title = safe_run(res_entry, "all", glue("{ns_entry}:title")),
    entry_url = safe_run(res_entry, "all", glue("{ns_entry}:id")),
    entry_last_updated = safe_run(res_entry, "all", glue("{ns_entry}:updated")),
    entry_author = safe_run(res_entry, "all", glue("{ns_entry}:author")),
    entry_content = safe_run(res_entry, "all", glue("{ns_entry}:content")),
    entry_link = ifelse(!is.null(e_link), e_link, def),
    entry_summary = safe_run(res_entry, "all", glue("{ns_entry}:summary")),
    entry_category = list(NA),
    entry_published = safe_run(res_entry, "all", glue("{ns_entry}:published")),
    entry_rights = safe_run(res_entry, "all", glue("{ns_entry}:rights"))
  )

to:

  # optional
  entries <- tibble(
    entry_title = safe_run(res_entry, "all", glue("{ns_entry}:title")),
    entry_url = safe_run(res_entry, "all", glue("{ns_entry}:id")),
    entry_last_updated = safe_run(res_entry, "all", glue("{ns_entry}:updated")),
    entry_author = safe_run(res_entry, "all", glue("{ns_entry}:author")),
    entry_content = safe_run(res_entry, "all", glue("{ns_entry}:content")),
    entry_link = xml_attr(xml_find_first(res_entry, glue("{ns_entry}:link")),"href"),
    entry_summary = safe_run(res_entry, "all", glue("{ns_entry}:summary")),
    entry_category = list(NA),
    entry_published = safe_run(res_entry, "all", glue("{ns_entry}:published")),
    entry_rights = safe_run(res_entry, "all", glue("{ns_entry}:rights"))
  )

lines 32,33 deleted. line 42 modified.

I'm not sure if this is generalisable, or if it is simply my particular usage.

RobertMyles commented 1 year ago

Thanks Andrew, I'll have a look at this asap.

Arf9999 commented 1 year ago

I've removed an IF statement which I guess was there for a reason, so my mucking about may cause other issues, but it seems ok for me with my particular feed.

RobertMyles commented 1 year ago

I've added this in, thanks Andrew, I will put you down as a contributor, thanks.

Arf9999 commented 1 year ago

Yeah... about that...

I figured out the reason for your initial 'if else'. If there are no entries for a feed, the script fails with an error. Hence the original test for is.null (it was just in the wrong place).

Sorry to do this, but could you replace my bad fix with this good one for line 52 of your revised code:

entry_link = ifelse(!is.null(xml_attr(xml_find_first(res_entry, glue("{ns_entry}:link")),"href")),
                        xml_attr(xml_find_first(res_entry, glue("{ns_entry}:link")),"href"), def),

This now includes the test for null entries that I cavalierly deleted from your original code.

(I really should do a PR but honestly don't know how to)

Thanks for the package!

RobertMyles commented 1 year ago

Hi Andrew, to do a PR, just fork the repo: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork

I've just pushed a new version of this to CRAN so this new change will have to wait for a little while, but I'd be happy to incorporate it into the package.