RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

Error: Tibble columns must have compatible sizes. #53

Closed werkstattcodes closed 4 years ago

werkstattcodes commented 4 years ago

Hi Robert,

many thanks for this very helpful package! I read that you welcome users to submit feeds which didn't work with tidyRSS.

I am currently struggling to get the feed below working. (I might be mistaken, but maybe it's related to description xml:space="preserve"?). Here's the link to the page offering the rss feed.

Many thanks!

library(tidyRSS)

rss_link <- "https://www.parlament.gv.at/PAKT/VHG/XXVII/ME/ME_00055/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&GP=XXVII&ITYP=ME&INR=55&SUCH=&listeId=142&FBEZ=FP_142"
tidyRSS::tidyfeed(rss_link)
#> GET request successful. Parsing...
#> Error: Tibble columns must have compatible sizes.
#> * Size 6623: Existing data.
#> * Size 19869: Column `item_description`.
#> i Only values of size one are recycled.

Created on 2020-09-23 by the reprex package (v0.3.0)

RobertMyles commented 4 years ago

Hi Roland,

Thanks for the issue, it's helpful to see what problems people run into in the wild. Appreciate the links and reprex too.

I haven't come across xml:space="preserve" before, and it seems like it's normally widely ignored, but you could be right, that might be causing the issue. The item descriptions are quite regular from what I see so they should be parsed fine with tidyRSS. I'll see if I can strip the attribute using xml2 whenever I get a chance, hopefully this weekend. Until then, I'll leave this open.

Thanks, Rob

RobertMyles commented 4 years ago

So this is happening because the xml:space="preserve" results in a list where we get this type of thing:

 [995] "\n\n\nAktualisierung:\n18.09.2020\n<br />\nArt:\nMinisterialentwurf Gesetz\n<br />\nNr.:\n6341/SN-55/ME\n<br />\n\n\n\n\n"
 [996] "\n"                                                                                                                       
 [997] "\n"                                                                                                                       
 [998] "\n\n\nAktualisierung:\n18.09.2020\n<br />\nArt:\nMinisterialentwurf Gesetz\n<br />\nNr.:\n6340/SN-55/ME\n<br />\n\n\n\n\n"
 [999] "\n"                                                                                                                       
 [1000] "\n"  

So I guess one option is to write a custom function that looks for list entries of length 1 and removes them if they contain "\n". This seems very custom to me, for an xml option I've never seen before. I'll have a think and see if there's a more generic way I can deal with it.

werkstattcodes commented 4 years ago

many thanks for looking into this! For my own purpose, it would be totally fine to have an option to simply exclude the specified element (description).

RobertMyles commented 4 years ago

Hi Roland, there's a fix for this in the desc_fix branch -- the fix I have messes up other things in tidyRSS so I'm not going to include it in the package until I have a better way of dealing with it. Until then, if you want to clone that branch and use that version of the package. It should clear up this problem, if it doesn't, let me know.

Closing because this seems to be a very rare problem.

werkstattcodes commented 4 years ago

Many thanks! Really appreciated!