datawookie / feedeR

Handle RSS and Atom feeds from R
29 stars 6 forks source link

Unable to parse date #3

Closed LucianoSP closed 8 years ago

LucianoSP commented 8 years ago

http://www.valor.com.br/financas/mercados/rss

datawookie commented 8 years ago

Hi Luciano, did you have a problem with that feed?

LucianoSP commented 8 years ago

Yes, and other feeds I tested here (brazillian portuguese) All give the error: Unable to parse date.

other feeds: http://feeds.folha.uol.com.br/mercado/rss091.xml http://rss.uol.com.br/feed/economia.xml

datawookie commented 8 years ago

Aha! Okay, busy working with this issue already. Have a look at Issue #2. German locale in that case.

LucianoSP commented 8 years ago

Thanks! Your package is very helpful!

datawookie commented 8 years ago

No problem. Should have these date issues resolved soon. I am going to bed now though. Sorry. I'll be back to fix this in 8 hours though.

datawookie commented 8 years ago

Hi Luciano,

Okay, I am working on a local fix. Please try this out: feedeR_0.0.2.tar.gz.

Let me know how that goes.

Thanks, Andrew.

datawookie commented 8 years ago

Hi Luciano, I have resolved Issue #2. I think that you might find that the changes to the master branch will also resolve your problems. Let me know. Thanks, Andrew.

LucianoSP commented 8 years ago

Thanks Andrew. It worked for most of the feeds. I still found a problem in one of them:

feed.extract("http://feeds.folha.uol.com.br/mercado/rss091.xml") Error: Unable to parse date. In addition: Warning message: All formats failed to parse. No formats found.

Also, for the feeds that worked, Im having a problem with encoding (the feed is brazilian portuguese). Do you think it´s possible to resolve that?

Here is the sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux stretch/sid

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] Rcpp_0.12.6 lubridate_1.5.6 XML_3.98-1.4 digest_0.6.10 dplyr_0.5.0.9000 withr_1.0.2
[7] assertthat_0.1 bitops_1.0-6 R6_2.1.2 DBI_0.5 git2r_0.15.0 magrittr_1.5
[13] httr_1.2.1.9000 stringi_1.1.1 curl_1.1 devtools_1.12.0 tools_3.3.1 stringr_1.0.0
[19] RCurl_1.95-4.8 feedeR_0.0.3 memoise_1.0.0 tibble_1.1

datawookie commented 8 years ago

Hi Luciano,

I have added another date/time format to deal with that feed. Please install again from GitHub and confirm that this resolves your problem.

Unfortunately the encoding of these feeds lies outside of the scope of this package at present. I'm really just focusing on accessing the data from the feeds. It'd be tricky to try and cater for all possible encodings. I think that in the first instance you'd need to try and handle this on a feed-by-feed basis.

If, however, you have an idea for how this might be incorporated into the package, please let me know and I'll see what I can do.

Thanks, Andrew.

LucianoSP commented 8 years ago

Thanks Andrew, The date worked perfectly now.

Regarding the encoding, maybe you can incorporate a parameter to the function call (feed.extract) and pass it to the XML package? Im using a solution now that works using something like this: xmlParse(getURL(url, .encoding = "ISO-8859-2"))

If you could do something like (feed.extract(url, encoding)) maybe it could help.

datawookie commented 8 years ago

Okay, have a look at the repository now. I've added an encoding argument to feed.extract().

LucianoSP commented 8 years ago

Working perfectly now! Thanks a lot Andrew. Your package will be very useful!!

datawookie commented 8 years ago

Cool. Let me know if there are any other suggestions.