Open malmufre opened 4 years ago
You might review the sections on Counting Characters and Counting Words in the assigned reading for the week:
https://ds4ps.org/cpp-527-spr-2020/lectures/string-processing.html
We also covered some examples in the review session this week:
Basically you need to split titles into words, then you would be able to apply a table() %>% sort() function to the new character vector with individual words to see which occur most frequently.
I have done the following to split the titles into words.
title.split <-strsplit( cleaned.titles, " " )
title.split
z <- lapply( title.split, length ) %>% unlist()
z
sum(z)
Then to find the the most common words used, I used the following but it does not seem to be working
table(title.split)%>%sort(title.split,decreasing = T)
I am getting Error in table(title.split) : all arguments must have the same length
What does your title.split object look like? It's helpful to include output in your examples ( head(title.split) ).
Try something like:
title.words.list <-strsplit( cleaned.titles, " " )
title.words.vector <- unlist( title.words.list )
table( title.words.vector ) %>% sort()
Thanks! @lecy
for Qs 2B , I tried the grep
function to be able to get the all the beginning words of titles but It doesn't seem to work .
Is there another way to find them ?
Hi @lecy
I was hoping you could point me to/ remind me of how to sort a table. When I print out lower.words <- tolower (cleaned.titles) title.words.list <-strsplit( lower.words, " " ) title.words.vector <- unlist( title.words.list ) table( title.words.vector ) %>% sort()
I am not seeing the table with counts. I need the sum of times each word is used and sorted in desc order. Can you point me in the right direction?
@JasonSills Here is what I am getting (tail is just printing the last 25 cells in the table - note titles have not been cleaned):
> table( title.words.vector ) %>% sort() %>% tail( 25 )
title.words.vector
be my are can an
177 180 182 188 217
from on i data what
219 250 284 286 287
why <strong class="markup--strong with is
328 339 339 413 459
you for in your and
527 600 615 635 738
of how a the to
858 877 1109 1634 1713
@malmufre You can approach this in a couple of ways. The easiest would be to write a function that identifies the first word in a title, then apply it to all titles.
# x is a single title
get_first_word <- function( x )
{
# split title x into words
# unlist results
# select the first word
# return first word
}
# test your function
x <- d$title[1]
get_first_word( x )
d2$title
[1] "A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model"
[2] "Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric"
[3] "How to Use ggplot2 in Python"
# sapply applies the function to all titles in the vector
# the default prints the original title with the return values
sapply( d2$title, get_first_word )
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model
"A"
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric
"Hands-on"
How to Use ggplot2 in Python
"How"
# only prints return values
sapply( d2$title, get_first_word, USE.NAMES=FALSE )
[1] "A" "Hands-on" "How"
Alternatively, you can use loops. This might be a little more intuitive at the point, but less efficient:
results <- NULL
for( i in 1:length(d$title) )
{
results[i] <- get_first_word( d$title[i] )
}
Hello @lecy
I am attempting to solve question 2 in lab 3 , however I am not able find the right function that will give me the most common words in the titles. I tried searching for ways , and I found the
Tidytext
package. I also found the count function , however It won't work for text . What do you suggest I use?