RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

tidyfeed fails when description contains HTML comments #57

Closed WilDoane closed 2 years ago

WilDoane commented 2 years ago

Reprex

tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")

Explanation

Some CMS include in the RSS item description field the HTML comment tag <!-- more --> to delineate content above/below the fold.

<description>On April 28, 2021, NASA&#039;s Parker Solar Probe reached the sun&#039;s extended solar atmosphere, known as the corona, and spent five hours there. The spacecraft is the first to enter the outer boundaries of our sun. <!-- more --></description>

This causes tidyRSS:::rss_parse to return 2 entries for description resulting in a mismatch in the number of rows the function attempts to create via tibble.

image

See https://github.com/RobertMyles/tidyRSS/blob/d5b223b9328dc2aaa26f87945801bfa364fc5fc3/R/rss_parse.R#L41

Example feed URL: https://www.sciencedaily.com/rss/top/science.xml

> tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
GET request successful. Parsing...

Error: Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/tibble_error_incompatible_size>
Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Backtrace:
 1. tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
 2. tidyRSS:::rss_parse(response, list, clean_tags, parse_dates)
 3. tibble::tibble(...)
 4. tibble:::tibble_quos(xs, .rows, .name_repair)
 5. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])

Observed Under

version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.0.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2021-12-16
 rstudio  2022.02.0-daily+324 Prairie Trillium (desktop)
 pandoc   NA

 tibble        3.1.6   2021-11-07 [2] CRAN (R 4.1.0)
 tidyRSS     * 2.0.4   2021-10-07 [2] CRAN (R 4.1.0)
RobertMyles commented 2 years ago

Hi William, apologies, not sure how this slipped under my radar! I'll have a look at this asap and see what I can adjust.

RobertMyles commented 2 years ago

Closed; fixed in 2.05.

RobertMyles commented 2 years ago

v2.0.6 on its way to CRAN -- thanks for the help, @chainsawriot !