DS4PS / cpp-527-fall-2020

0 stars 1 forks source link

Lab 3- Question 2 #17

Open malmufre opened 3 years ago

malmufre commented 3 years ago

Hello @lecy

I am attempting to solve question 2 in lab 3 , however I am not able find the right function that will give me the most common words in the titles. I tried searching for ways , and I found the Tidytext package. I also found the count function , however It won't work for text . What do you suggest I use?

lecy commented 3 years ago

You might review the sections on Counting Characters and Counting Words in the assigned reading for the week:


We also covered some examples in the review session this week:


lecy commented 3 years ago

Basically you need to split titles into words, then you would be able to apply a table() %>% sort() function to the new character vector with individual words to see which occur most frequently.

malmufre commented 3 years ago

I have done the following to split the titles into words.

title.split <-strsplit( cleaned.titles, " " ) 

z <- lapply( title.split, length ) %>% unlist()

Then to find the the most common words used, I used the following but it does not seem to be working table(title.split)%>%sort(title.split,decreasing = T)

I am getting Error in table(title.split) : all arguments must have the same length

lecy commented 3 years ago

What does your title.split object look like? It's helpful to include output in your examples ( head(title.split) ).

Try something like:

title.words.list <-strsplit( cleaned.titles, " " ) 
title.words.vector <- unlist( title.words.list )
table( title.words.vector ) %>% sort()
malmufre commented 3 years ago

Thanks! @lecy for Qs 2B , I tried the grep function to be able to get the all the beginning words of titles but It doesn't seem to work . Is there another way to find them ?

JasonSills commented 3 years ago

Hi @lecy

I was hoping you could point me to/ remind me of how to sort a table. When I print out lower.words <- tolower (cleaned.titles) title.words.list <-strsplit( lower.words, " " ) title.words.vector <- unlist( title.words.list ) table( title.words.vector ) %>% sort()

I am not seeing the table with counts. I need the sum of times each word is used and sorted in desc order. Can you point me in the right direction?

lecy commented 3 years ago

@JasonSills Here is what I am getting (tail is just printing the last 25 cells in the table - note titles have not been cleaned):

> table( title.words.vector ) %>% sort() %>% tail( 25 )
                   be                    my                   are                   can                    an 
                  177                   180                   182                   188                   217 
                 from                    on                     i                  data                  what 
                  219                   250                   284                   286                   287 
                  why               <strong class="markup--strong                  with                    is 
                  328                   339                   339                   413                   459 
                  you                   for                    in                  your                   and 
                  527                   600                   615                   635                   738 
                   of                   how                     a                   the                    to 
                  858                   877                  1109                  1634                  1713 
lecy commented 3 years ago

@malmufre You can approach this in a couple of ways. The easiest would be to write a function that identifies the first word in a title, then apply it to all titles.

# x is a single title
get_first_word <- function( x )
  # split title x into words
  # unlist results
  # select the first word
  # return first word

# test your function
x <- d$title[1]
get_first_word( x )

[1] "A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model"
[2] "Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric"
[3] "How to Use ggplot2 in Python"                                   

# sapply applies the function to all titles in the vector
# the default prints the original title with the return values

sapply( d2$title, get_first_word )
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model 
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric 
                                   How to Use ggplot2 in Python 
# only prints return values 
sapply( d2$title, get_first_word, USE.NAMES=FALSE )
[1] "A"        "Hands-on" "How"

Alternatively, you can use loops. This might be a little more intuitive at the point, but less efficient:

results <- NULL
for( i in 1:length(d$title) )
  results[i] <- get_first_word( d$title[i] )