RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

Identification of feed type #76

Closed tombroekel closed 7 months ago

tombroekel commented 11 months ago

Hi,

currently, in the type_check function the assessment is entirely based on response$headers$content-type. However, for some feeds this seems misleading, e.g.: https://www.tagesschau.de/infoservices/alle-meldungen-100~atom.xml . This case will be wrongly classified as RSS. A work around is to also consider the information contained in the URL (see below), but it is a rather specific solution. Therefore, I didn't add it directly, maybe you got a better idea.

content_type <- response$headers$content-type url_type <- response$url typ <- case_when(grepl(x = url_type, pattern = "atom") ~ "atom", grepl(x = content_type, pattern = "xml") ~ "rss", grepl(x = content_type, pattern = "html") ~ "rss", grepl(x = content_type, pattern = "atom") ~ "atom", grepl(x = content_type, pattern = "rss") ~ "rss", grepl(x = content_type, pattern = "json") ~ "json", TRUE ~ "unknown")

RobertMyles commented 7 months ago

Hi Tom, thank you but this does seem too specific. I'm open to improving type_check() but checking the content-type seems pretty standard. I'll close for now but if you have a general solution, I'm definitely open to exploring that.