RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

Tibble parsing error (Error: Tibble columns must have consistent lengths, only values of length one are recycled) #40

Closed alastairrushworth closed 4 years ago

alastairrushworth commented 4 years ago

Hi Rob!

The new version of tidyRSS is great :)

I noticed that some feeds I have that parsed with a previous tidyRSS version were failing. I've attached a single example here. It seems to occur somewhere in the parsing of the feed into a tibble.

This using the most up-to-date version:

devtools::install_github('RobertMyles/tidyRSS')
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

GET request successful. Parsing...

Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 25: Columns `entry_title`, `entry_url`, `entry_last_updated`, `entry_content`, `entry_published`
* Length 26: Column `entry_author`
Run `rlang::last_error()` to see where the error occurred.

Using a slightly older version (I think this commit was in January):

devtools::install_github('RobertMyles/tidyRSS', ref = '35bcbb7e15be1c0edc1ca07cc33de64923a55a32')
# RESTART R FIRST
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

# A tibble: 25 x 8
   feed_title feed_link feed_author feed_last_updat… item_title item_date_updated  
   <chr>      <chr>     <chr>       <chr>            <chr>      <dttm>             
 1 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Setting u… 2016-02-24 00:21:50
 2 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Machine… 2016-02-23 21:35:30
 3 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Guess… 2016-02-16 12:58:40
 4 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Official … 2016-02-14 22:24:35
 5 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Five … 2016-02-08 20:14:09
 6 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Controv… 2015-08-05 15:51:26
 7 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Performan… 2015-07-17 18:31:18
 8 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… An exampl… 2015-07-14 16:55:36
 9 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Fastest w… 2015-07-13 17:03:02
10 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The R Con… 2015-07-02 14:15:57
# … with 15 more rows, and 2 more variables: item_link <chr>, item_content <chr>

Thanks,

Alastair

RobertMyles commented 4 years ago

Hi Alastair, thanks for reporting this. I thought I'd made the package a bit more bug-proof, but obviously not. And this feed had some surprises for me.

Anyway, there is a fix now in the 'namespace' branch (remotes::install_github("robertmyles/tidyrss@namespace")). The problem was the xpath used to find the entries, and this could potentially be a problem for other feeds, so I appreciate the other issue you opened linking to those feeds as I can use those for testing.

I'll play with this a bit more over the next week and merge it into the master branch asap. Here's how it looks now:

> tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default") 
GET request successful. Parsing...

# A tibble: 25 x 15
   feed_title feed_url feed_last_updated   feed_author feed_link feed_category feed_generator
   <chr>      <chr>    <dttm>              <chr>       <chr>     <list>        <chr>         
 1 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 2 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 3 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 4 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 5 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 6 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 7 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 8 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
 9 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
10 Big Compu… tag:blo… 2020-03-14 03:08:05 nphardhttp… http://b… <chr [1]>     Blogger       
# … with 15 more rows, and 8 more variables: entry_title <chr>, entry_url <chr>,
#   entry_last_updated <dttm>, entry_author <chr>, entry_content <chr>, entry_link <chr>,
#   entry_category <list>, entry_published <dttm>