Open lghb2005 opened 3 years ago
Mac or PC?
Can you tell me what you get at this point?
library( dplyr )
library( tidytext )
# change to all words in titles into small letters
words.lower <- tolower( d$title )
# split titles into single words
words.list <- strsplit( words.lower, " ")
words.vector <- unlist(words.list)
head( words.vector )
Also, words.vector should just be a regular character vector of words.
You might be introducing some problems by trying to cast a vector as a data.frame (usually you combine several vectors into a data frame).
So not sure what all is happening here:
# word counts
data_frame( words.vector ) %>%
unnest_tokens( word, words.vector ) %>%
anti_join( stop_words ) %>%
count( word, sort = TRUE )
If the vector is fine, you would count elements in a vector with table:
table( words.vector )
dplyr is seductive! But those functions only work with data frames, not vectors.
Hi Dr.Lecy,
Many thanks for coming to help! I am working on PC with windows 10. I guess the vector is fine. I then do a table and transfer the table to a data frame to count the word frequency. But I get another problem. It seems the stopwords have not removed from the words.vector.
# change to all words in titles into small letters
words.lower <- tolower( d$title )
# split titles into single words
words.list <- strsplit( words.lower, " ")
words.vector <- unlist( words.list )
# remove stopwords
stop.words <- stop_words[ 1: nrow( stop_words ), ]
stop.words.vector <- stop.words$word
words.vector <- gsub( stop.words.vector, "", words.vector )
head( words.vector )
------------------------
# this returns me the following
argument 'pattern' has length > 1 and only the first element will be used
[1] "" "beginner’s" "guide" "to" "word" "embedding"
I am now able to count the word frequency but still have some noisy stop words. Is something wrong with my code? Thanks!
# count the number of words in words.vector
words.vector.table <- table( words.vector )
words.dt <- as.data.frame( words.vector.table )
words.dt[ order( -words.dt$Freq ), ]
# THIS IS THE RETURN
words.vector
<fctr>
Freq
<int>
11250 to 1713
10893 the 1634
1 1113
4719 how 877
7656 of 858
7115 nd 738
12697 your 635
4995 in 615
3868 for 600
12639 you 527
Also, removing stopwords seems that the functions only replace the matched words with null but leave the empty vector entry in the original string. So my question is how to remove both the values and their positions (entries) from the original string (or maybe subslice)?
This is the issue here:
words.vector <- gsub( stop.words.vector, "", words.vector )
Warning message:
In gsub(stop.words.vector, "", words.vector) :
argument 'pattern' has length > 1 and only the first element will be used
The function gsub() is not vectorized, meaning it won't iterate through your stop.words.vector automatically.
As a result, it's only removing the first word in the list and ignoring the rest.
Here is a quick loops solution:
for( i in stop.words.vector )
{
words.vector <- gsub( i, "", words.vector )
}
You are splitting titles into a words vector, then replacing stop words. This will result in lots of empty cases, as you have noted.
You could delete stop words, then split titles into word vectors.
Or you can just filter out empty cases after you remove all of the stop words.
words.vector <- words.vector[ words.vector != "" ]
You can even remove stop words from the table after created:
t <- table( words.vector )
t <- sort( t, decreasing=T )
d <- as.data.frame( t )
d <- d[ ! d$words.vector %in% stop.words.vector , ]
# d <- dplyr::filter( d, ! d$words.vector %in% stop.words.vector ) # dplyr version
These examples are why we start with data types and logical statements in CPP 526.
Hopefully you can see how data recipes are starting to take shape.
And how you need to be able to convert from one data type to another to build an effective recipe.
Also that there are always six dozen ways to do the same thing. Only a handful of ways to do it efficiently. And usually a couple of ways that are robust to changes in the code earlier in the script or edge cases like special characters.
Learning curve is (1) being able to read someone else's solution and see how it works, (2) create your own inefficient and clunky solution, (3) create your own efficient solution, (4) create your own robust solution that includes unit tests for certainty.
Every now and then you will see a solution that is 3 lines of bulletproof code for solving a complex data step. You immediately recognize them as tiny masterpieces, like a Shakespearian sonnet.
Hi Dr.Lecy, thanks again for helping me out. I now know the trick that gsub() is not vectorized. Also, much appreciate the learning experiences in coding!
Hi Dr. Lecy,
For the Q2 part of this lab, my result on word frequency contains some text codes. I have tried but still no solutions. I wonder if there is something wrong with my code or r base functions can do this. Thanks in advance!
This is the result my code returns.