html entities - Githubissues

patperry commented 7 years ago

There are lots of html entities in the useR2017 data. For example, useR2017[11] has two instances of &. There are some other examples below.

If I want to analyze this data, I need to decode the entities (e.g., replace & and > with & and >). Do you think it makes sense for the tweetstorm package to do the decoding so that the end-user doesn't have to worry about it?

> corpus::text_locate(tweetstorm::useR2017, "&")
    text term before                       instance                        after
1   11   &    …ost #useR2017...very happy     &     amp; grateful I met so many…
2   11   &    …eful I met so many smiling     &     amp; talented persons.Time …
3   64   &                    @nj_tierney     &     lt;- c(🎉, 🍻, 🎤, 🎶) #use…
4   68   &     FYI #useR2017 participants     &     amp; @EANBoard @teachepi @m…
5   93   &    …grade after @useR_Brussels     &     amp; @pydataberlin thanks a…
6   101  &    …good, esp for one built in     &     lt;24 hrs. Kudos to Romain …
7   101  &    …lt;24 hrs. Kudos to Romain     &     amp; Shiny… https://t.co/lG…
8   109  &    …a lot from my twitter feed     &     amp; @romain_francois's exc…
9   123  &     After #useR2017 in Brussel     &     amp; Brisbane in 2018, we w…
10  124  &    …f DataCamp, RDocumentation     &     amp; translating… https://t…
11  125  &    …on processes to help sales     &     amp; customer success with …
12  132  &          .@RLadiesGlobal emoji     &     amp; 😻 collage. #RLadies #…
13  134  &                    🎉🌎 slides     &     amp; tutorial: "GeoSpatial …
14  142  &    …ay homage to the original %    &     gt;% #user2017 https://t.co…
15  163  &    …atthias Verbeke, @ZShkedy,     &     amp; @HeathrTurnr for all o…
16  185  &    …rab from the CRAN archives     &     amp; compile yourself!\n #u…
17  224  &    …nk about where we're going     &     amp; where we've been! Love…
18  276  &    …e you can make a messenger     &     amp; planner app using #rst…
⋮
(139 rows total)

romainfrancois commented 7 years ago

Sounds like a good idea

patperry commented 7 years ago

Do you want me to take care of it? Python has html.unescape. I'm not sure what the equivalent command is in R, but it'd be easy to hack together a regex to handle &(amp|gt|lt);, the only entities that show up in useR2017.

romainfrancois commented 7 years ago

There's probably something already, perhaps in htmltools or in rmarkdown. up to you. I might revisit this to cover the storm from JSM2017

patperry commented 7 years ago

I'll let you handle it then. I didn't see an "unescape" function in htmltools or rmarkdown. Feel free to use or adapt the following:

unescape <- function(x)
{
    x <- gsub("&gt;", ">", x)
    x <- gsub("&lt;", "<", x)
    x <- gsub("&amp;", "&", x)
    x
}

Example usage:

> unescape(c("&lt;html&gt;", "nothing", "a, b, &amp; c"))
[1] "<html>"    "nothing"   "a, b, & c"

ThinkR-open / tweetstorm

html entities #8