Watts-College / paf-514-template

https://watts-college.github.io/paf-514-template/
1 stars 0 forks source link

Lab 4 Clarification and Question #80

Open lindaalvarez opened 1 week ago

lindaalvarez commented 1 week ago

Hello, I have three questions/clarifications that I wanted to run by you @castower

  1. When I start part 2 and run this code:

    library(quanteda)

    Convert mission statements to lowercase

    dat$mission <- tolower(dat$mission)

    Create a corpus from the full dataset

    corp <- corpus(dat, text_field = "mission") corp

my texts 4 and 5 show up like this and I am not sure why: text4 : " " text5 : " "

  1. I wanted to clarify that we will be replacing the 10 terms on the my_dictionary with 10 terms that we see are fit for the lab correct?

  2. When I run the tokens code I get an error and I'm not quite sure how to fix it. tokens %>% dfm_stem( stem=F ) %>% topfeatures( ) tokens %>% dfm( stem=F ) %>% topfeatures( ) Error in dfm_stem(., stem = F) : could not find function "dfm_stem"

Thank you so much for your time and help

castower commented 6 days ago

Hello @lindaalvarez ,

Thanks for your questions!

  1. Rows 4 and 5 are blank in the original data set so this persists in the corpus. You can remove them if you'd like.

See: dat$mission %>% head()

  1. You will want to add 10 additional terms to my_dictionary and keep the ones from the lab.
  2. You can use the dfm_wordstem() method from the codethrough tutorial (note this example is finding the top 50 stems from the Anne of Green Gables series, you'll want to find the top 10 from the mission tokens):

anne_corpus_tokens %>% dfm() %>% dfm_wordstem() %>% topfeatures( 50 )

Let me know if you have any other questions!

lindaalvarez commented 6 days ago

thank you so much for your help! I went ahead and changed it and now it's working (:

lindaalvarez commented 6 days ago

@castower I am now in P3-Q2 part where we are supposed to create a random sample of 20 organizations, but I keep getting this error:

sample <- dplyr::sample_n( d.immigrant, 20 ) print(sample)

Warning in gsub("^\s+|\s+$", "", y) : unable to translate 'Arts, Culture, and Humanities: Arts, Cultural Organizations\u0080\u0094Multipurpose ' to a wide string Error in gsub("^\s+|\s+$", "", y) : input string 20 is invalid

castower commented 6 days ago

@lindaalvarez

You can email me your .RMD file and I'll take a look.

swest235 commented 5 days ago

@castower I'm hoping for clarification on part 2 tabulating and stemming chunks. If I am not mistaken, we are to use something like this to tabulate words, is that right?

tokens %>% dfm() %>% dfm_wordstem() %>% topfeatures(10)

Second related question is about the 'stemming' section. It uses the same code chunk, is that just to reiterate stemming or is there supposed to be a different result when we run that?

I'm assuming it was to reiterate how stemming works but wanted to confirm before I turn the lab in.

Thank you in advance.

castower commented 5 days ago

@swest235

  1. Yes, that's correct, it'll find the topfeatures
  2. Yes, that's also correct. You don't need to run both, but if you'd like to contrast both you can find the top features for the non-stems with the following:

tokens %>% dfm() %>% topfeatures(10)