RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

tidyRSS fails to parse feeds: "xmlXPathEval: evaluation failed" #31

Closed alastairrushworth closed 4 years ago

alastairrushworth commented 4 years ago

Hi @RobertMyles

Thanks for the amazing tidyRSS package, I find it very useful indeed! Thought I'd get in touch to file a quick issue as I've noticed that quite a number of feeds don't parse correctly.

For example:

# tested with v1.2.11
library(tidyRSS)
tidyfeed("http://abigailsee.com/feed.xml")

Returns the error:

Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = 1) : 
  xmlXPathEval: evaluation failed

I think the feed is ok, and it seems like tidyfeed gathers the feed ok, but something goes awry with the parsing somewhere? I noticed this issue with several other feeds that I've copied below

feed_vec <- 
  c("http://abigailsee.com/feed.xml",
    "https://adamgoodkind.com/feed.xml",
    "http://adomingues.github.io/feed.xml",
    "http://aebou.rbind.io/index.xml",
    "http://agrarianresearch.org/blog/?feed=rss2",
    "http://akosm.netlify.com/index.xml",
    "http://alburez.me/feed.xml",
    "http://alexmorley.me/feed.xml",
    "https://alexwhan.com/index.xml",
    "http://allthingsr.blogspot.com/feeds/posts/default?alt=rss",
    "http://allthiswasfield.blogspot.com/feeds/posts/default?alt=rss",
    "http://almostrandom.netlify.com/index.xml",
    "http://altran-data-analytics.netlify.com/index.xml",
    "https://www.amitkohli.com/index.xml",
    "http://analisisydecision.es/feed/",
    "http://andysouth.github.io/feed.xml",
    "http://annakrystalli.me/index.xml",
    "http://annarborrusergroup.github.io/feed.xml",
    "http://anotherblogaboutr.blogspot.com/feeds/posts/default?alt=rss",
    "http://anpefi.eu/index.xml",
    "https://fishandwhistle.net/index.xml",
    "https://www.ardata.fr/index.xml",
    "http://arnab.org/blog/atom.xml",
    "http://arunatma.blogspot.com/feeds/posts/default?alt=rss",
    "http://asbcllc.com/feed.xml",
    "http://ashiklom.github.io/feed.xml",
    "http://aurielfournier.github.io/feed.xml",
    "http://austinwehrwein.com/index.xml")

I'm working on a side project at the moment that involves about 3K RSS feeds, which I'm happy to share once I've tidied up a bit, it might be helpful with identifying other edge cases - I know how finicky RSS feeds can be! I'm also happy to help with this issue if you can point me in the right direction!

Thanks,

Alastair

RobertMyles commented 4 years ago

Hi Alastair,

Yeah, RSS feeds can be a pain, and I've seen this error a few times. I'm not sure off the top of my head where exactly it pops up. I'll have a look as soon as I can, but if you're interested in contributing, it's probably happening in one of the *_parse functions. I'm trying to clean up a lot of little things in the package for a 1.3 version, so your list will help a lot. In the meantime, I'll leave this open until I can figure out the source of the error.

Rob

RobertMyles commented 4 years ago

I had a quick chance to look at this today and with the dev version I'm getting:

> tidyfeed("http://abigailsee.com/feed.xml")
# A tibble: 5 x 5
  feed_link    item_title          item_date_published item_description                 item_link             
  <chr>        <chr>               <dttm>              <chr>                            <chr>                 
1 http://abig… What makes a good … 2019-08-13 00:00:00 "<!--excerpt.start-->\n<p><em>T… http://abigailsee.com…
2 http://abig… Deep Learning, Str… 2018-02-21 00:00:00 "<!--excerpt.start-->\n<html>\n… http://abigailsee.com…
3 http://abig… Four deep learning… 2017-08-30 00:00:00 "<head>\n<script src=\"https://… http://abigailsee.com…
4 http://abig… Four deep learning… 2017-08-30 00:00:00 "<head>\n<script src=\"https://… http://abigailsee.com…
5 http://abig… Taming Recurrent N… 2017-04-16 00:00:00 "<!--excerpt.start-->\n<p><em>T… http://abigailsee.com…

With your vector of feeds, I get:

purrr::map(feed_vec, ~ {
  stfeed <- purrr::safely(tidyfeed)
  ret <- stfeed(.x)
  if (is.null(ret$error)) {
    print("Feed OK")
  } else {
    print("Feed unavailable")
  }
})

[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"

I'll try to get 1.3 finished as soon as possible, as you can see, it fixes most of these problems. In the meanwhile, if you'd like to try the dev version, it should help.

alastairrushworth commented 4 years ago

Hi Rob - that's perfect, I think that fixes it completely for me. Thanks a lot for that, I'll stick to dev until 1.3.

I'll drop you a note when I've got the long list of feeds tidied up, in case it can help.

Cheers!

RobertMyles commented 4 years ago

That would be a help, thanks Alastair.