RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

reading Date for French #37

Closed boussouf closed 4 years ago

boussouf commented 4 years ago

Hello,

pubDate in French format are sometimes not read by the tidyRSS function. unfortunatly, the return column is NA, so we lose this information.

Example: View(tidyfeed("https://www.valdemarne.fr/rss.xml"))

Thank you

RobertMyles commented 4 years ago

Thanks for the bug report, @boussouf . That's an interesting problem, see here . I've been using anytime() to parse dates, whereas before v2 I had used lubridate. I might go back to lubridate, although this would have failed under previous versions of tidyRSS too.

library(xml2)
library(httr)
library(magrittr)
library(anytime)

rss <- "https://www.valdemarne.fr/rss.xml"

GET(rss) %>% 
  read_xml() %>% 
  xml_find_all("channel") %>% 
  xml_find_all("item") %>% 
  xml_find_all("pubDate") %>% 
  xml_text()
#>  [1] "Vendredi, 13 Mars, 2020 - 12:22"    "Lundi, 24 Février, 2020 - 15:42"   
#>  [3] "Vendredi, 21 Février, 2020 - 14:17" "Vendredi, 21 Février, 2020 - 11:45"
#>  [5] "Mardi, 18 Février, 2020 - 11:00"    "Vendredi, 14 Février, 2020 - 13:22"
#>  [7] "Mardi, 11 Février, 2020 - 16:24"    "Mardi, 11 Février, 2020 - 16:01"   
#>  [9] "Mardi, 11 Février, 2020 - 14:20"    "Mardi, 11 Février, 2020 - 11:40"

GET(rss) %>% 
  read_xml() %>% 
  xml_find_all("channel") %>% 
  xml_find_all("item") %>% 
  xml_find_all("pubDate") %>% 
  xml_text() %>% 
  anytime()
#>  [1] NA NA NA NA NA NA NA NA NA NA

GET(rss) %>% 
  read_xml() %>% 
  xml_find_all("channel") %>% 
  xml_find_all("item") %>% 
  xml_find_all("pubDate") %>% 
  xml_text() %>% 
  lubridate::parse_date_time("dmy hm", locale = "fr_FR.UTF-8")
#> Warning: hms, hm and ms usage is deprecated, please use HMS, HM or MS instead.
#> Deprecated in version '1.5.6'.
#>  [1] "2020-03-13 12:22:00 UTC" "2020-02-24 15:42:00 UTC"
#>  [3] "2020-02-21 14:17:00 UTC" "2020-02-21 11:45:00 UTC"
#>  [5] "2020-02-18 11:00:00 UTC" "2020-02-14 13:22:00 UTC"
#>  [7] "2020-02-11 16:24:00 UTC" "2020-02-11 16:01:00 UTC"
#>  [9] "2020-02-11 14:20:00 UTC" "2020-02-11 11:40:00 UTC"

Created on 2020-02-26 by the reprex package (v0.3.0)

For the moment, I don't have a quick fix for this, though I'll have something in version 2.0.1. That will be here on GH in the next few weeks but will take a while to get to CRAN as I don't want to spam them with releases.

RobertMyles commented 4 years ago

Tracking this here: https://github.com/RobertMyles/tidyRSS/projects/3