Open patperry opened 7 years ago
Sounds like a good idea
Do you want me to take care of it? Python has html.unescape
. I'm not sure what the equivalent command is in R, but it'd be easy to hack together a regex to handle &(amp|gt|lt);
, the only entities that show up in useR2017
.
There's probably something already, perhaps in htmltools
or in rmarkdown
.
up to you. I might revisit this to cover the storm from JSM2017
I'll let you handle it then. I didn't see an "unescape" function in htmltools
or rmarkdown
. Feel free to use or adapt the following:
unescape <- function(x)
{
x <- gsub(">", ">", x)
x <- gsub("<", "<", x)
x <- gsub("&", "&", x)
x
}
Example usage:
> unescape(c("<html>", "nothing", "a, b, & c"))
[1] "<html>" "nothing" "a, b, & c"
There are lots of html entities in the
useR2017
data. For example,useR2017[11]
has two instances of&
. There are some other examples below.If I want to analyze this data, I need to decode the entities (e.g., replace
&
and>
with&
and>
). Do you think it makes sense for thetweetstorm
package to do the decoding so that the end-user doesn't have to worry about it?