EmilHvitfeldt / textdata

Download, parse, store, and load text datasets instead of storing it in packages
https://emilhvitfeldt.github.io/textdata/
Other
75 stars 13 forks source link

Add NRC lexicon to textdata #11

Closed juliasilge closed 5 years ago

juliasilge commented 5 years ago

I worked on adding the NRC emotion lexicon to textdata this evening, to address #10 and other issues hanging out in tidytext and the tidytext book.

BAD NEWS 😩

The links on the NRC site are all http, not https.

The CRAN policies say this:

Downloads of additional software or data as part of package installation or startup should only use secure download mechanisms (e.g., ‘https’ or ‘ftps’).

What do you think our options are? Any ideas?

EmilHvitfeldt commented 5 years ago

The pull request looks great!

I think our best options would be to see it is it possible for them to provide https downloads. Might be a high order, but the rest of their site is https so I don't know. Or ask them if some redistribution would be okay.

juliasilge commented 5 years ago

Yeah, that would be the best option; I'll write back via email and see if they are open to doing that.

juliasilge commented 5 years ago

Actually, I am reading the CRAN policies again:

Downloads of additional software or data as part of package installation or startup should only use secure download mechanisms (e.g., ‘https’ or ‘ftps’).

(Bolding is mine.) Could it be argued that these downloads are not part of installation or startup? They are prompted by the user instead. Thoughts? I could also loop in some rOpenSci or other experts on this.

EmilHvitfeldt commented 5 years ago

That is a good catch! I think it could be argued that it not part of installation or startup. furthermore it would be trivial to include http/https/ftps information in the download prompt.

EmilHvitfeldt commented 5 years ago

I'm comfortable with including it under the new reading. If you want you loop someone in I'll wait to merge.

juliasilge commented 5 years ago

I just posted on the rOpenSci Slack to see if anyone has had relevant experience. I imagine people won't see it until tomorrow. Let's wait to see if anybody has run into something similar or has insight, but from a plain reading, this does seem like it is in line with the policies.

The links currently say whether they are http or https to the user, but it may be good to call out the download method info more explicitly in the prompt.

juliasilge commented 5 years ago

Expert-type folks on the rOpenSci Slack seem fairly unanimous that http should be OK in this situation, a download prompted by the user but not part of package installation. From my perspective, this is good to go (merge).

I do think the change suggested in #12 is a good idea still as well.

EmilHvitfeldt commented 5 years ago

Sounds good! I'll merge and get working on #12. Thanks! 🎉

fantasycz commented 5 years ago

@juliasilge Hi Juliasilge, I am wondering what I should I change if I want to use get_sentiments("nrc"). So far, when I run this, it still throws the error, Error in match.arg(lexicon): 'arg' should be one of “afinn”, “bing”, “loughran”. Thanks!

juliasilge commented 5 years ago

Thanks for asking this question @fantasycz! 🙌

As of today, the NRC lexicon is available within tidytext again. You will need to install the development versions of both textdata and tidytext, and then all will work as before.

library(remotes)
install_github("EmilHvitfeldt/textdata")
install_github("juliasilge/tidytext")

After these are installed, you can access the NRC lexicon the way as you did previously:

library(tidytext)
get_sentiments("nrc")

We will get both of these updates on CRAN soon. 🎉

fantasycz commented 5 years ago

Hi Julia,

Thank you for your reply and your work for NRC lexicon. I tried the way you said. Install

install_github("EmilHvitfeldt/textdata")

install_github("juliasilge/tidytext")

Then

get_sentiments("nrc").

However, it stills showed error

Error in match.arg(lexicon): 'arg' should be one of “afinn”, “bing”, “loughran”.

Below is my code, am I missing something?

one_star_word_df <- valence_df_na %>% filter(overallRating == '1') %>% unnest_tokens(word, headline, token = "words", format = "text") %>% inner_join(get_sentiments("nrc")) %>% rename(NRC_sentiment = sentiment) %>% mutate(NRC_sentiment = as.factor(NRC_sentiment))

The whole error is

Error in match.arg(lexicon): 'arg' should be one of “afinn”, “bing”, “loughran” Traceback:

  1. valence_df_na %>% filter(overallRating == "1") %>% unnest_tokens(word, . headline, token = "words", format = "text") %>% inner_join(get_sentiments("nrc")) %>% . rename(NRC_sentiment = sentiment) %>% mutate(NRC_sentiment = as.factor(NRC_sentiment))
  2. withVisible(eval(quote(_fseq(_lhs)), env, env))
  3. eval(quote(_fseq(_lhs)), env, env)
  4. eval(quote(_fseq(_lhs)), env, env)
  5. _fseq(_lhs)
  6. freduce(value, _function_list)
  7. function_list[i]
  8. inner_join(., get_sentiments("nrc"))
  9. inner_join.data.frame(., get_sentiments("nrc"))
  10. as.data.frame(inner_join(tbl_df(x), y, by = by, copy = copy, . ...))
  11. inner_join(tbl_df(x), y, by = by, copy = copy, ...)
  12. inner_join.tbl_df(tbl_df(x), y, by = by, copy = copy, ...)
  13. check_valid_names(tbl_vars(y))
  14. tbl_vars(y)
  15. new_sel_vars(tbl_vars_dispatch(x), group_vars(x))
  16. structure(vars, groups = group_vars, class = c("dplyr_sel_vars", . "character"))
  17. tbl_vars_dispatch(x)
  18. get_sentiments("nrc")
  19. match.arg(lexicon)
  20. stop(gettextf("'arg' should be one of %s", paste(dQuote(choices), . collapse = ", ")), domain = NA)

Thank you very much.

Best,

Zhen

On Fri, Jul 19, 2019 at 3:00 PM Julia Silge notifications@github.com wrote:

Thanks for asking this question @fantasycz https://github.com/fantasycz! 🙌

As of today, the NRC lexicon is available within tidytext again. You will need to install the development versions of both textdata and tidytext, and then all will work as before.

library(remotes)

install_github("EmilHvitfeldt/textdata")

install_github("juliasilge/tidytext")

After these are installed, you can access the NRC lexicon the way as you did previously:

library(tidytext)

get_sentiments("nrc")

We will get both of these updates on CRAN soon. 🎉

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EmilHvitfeldt/textdata/pull/11?email_source=notifications&email_token=ACFIH5P3CXFU7Z2YTTUH3GTQAI2QTA5CNFSM4ICO3WM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2M4CWQ#issuecomment-513392986, or mute the thread https://github.com/notifications/unsubscribe-auth/ACFIH5JGKAKPEIJ5IMXLQJTQAI2QTANCNFSM4ICO3WMQ .

-- Zhen Chen Electrical Engineering & Computer Science University of California Irvine Irvine, CA 92617

juliasilge commented 5 years ago

Hmmmm, sounds like you don't actually have the updated version of tidytext installed, because "nrc" is in fact one of the arguments again. Want to try installing again, with force = TRUE?

remotes::install_github("juliasilge/tidytext", force = TRUE)
remotes::install_github("EmilHvitfeldt/textdata", force = TRUE)