DS4PS / cpp-527-spr-2021

http://ds4ps.org/cpp-527-spr-2021/
0 stars 0 forks source link

Lab 03: text codes #10

Open lghb2005 opened 3 years ago

lghb2005 commented 3 years ago

Hi Dr. Lecy,

For the Q2 part of this lab, my result on word frequency contains some text codes. I have tried but still no solutions. I wonder if there is something wrong with my code or r base functions can do this. Thanks in advance!

library( dplyr )
library( tidytext )

# change to all words in titles into small letters
words.lower <- tolower( d$title )

# split titles into single words
words.list <- strsplit( words.lower, " ") 
words.vector<- unlist(words.list)

# word counts
data_frame( words.vector ) %>% 
    unnest_tokens( word, words.vector ) %>% 
    anti_join( stop_words ) %>% 
    count( word, sort = TRUE )

This is the result my code returns.

<chr>
n
<int>
<U+62AF>    441         
<U+720F>    406         
data    350         
<U+720C>    323         
<U+7210>    272         
<U+71FC>    267         
<U+7213>    267         
<U+71FE>    243         
<U+7209>    242         
<U+62B0>    232 
lecy commented 3 years ago

Mac or PC?

Can you tell me what you get at this point?

library( dplyr )
library( tidytext )

# change to all words in titles into small letters
words.lower <- tolower( d$title )

# split titles into single words
words.list <- strsplit( words.lower, " ") 
words.vector <- unlist(words.list)

head( words.vector )
lecy commented 3 years ago

Also, words.vector should just be a regular character vector of words.

You might be introducing some problems by trying to cast a vector as a data.frame (usually you combine several vectors into a data frame).

So not sure what all is happening here:

# word counts
data_frame( words.vector ) %>% 
    unnest_tokens( word, words.vector ) %>% 
    anti_join( stop_words ) %>% 
    count( word, sort = TRUE )

If the vector is fine, you would count elements in a vector with table:

table( words.vector )

dplyr is seductive! But those functions only work with data frames, not vectors.

lghb2005 commented 3 years ago

Hi Dr.Lecy,

Many thanks for coming to help! I am working on PC with windows 10. I guess the vector is fine. I then do a table and transfer the table to a data frame to count the word frequency. But I get another problem. It seems the stopwords have not removed from the words.vector.

# change to all words in titles into small letters
words.lower <- tolower( d$title )

# split titles into single words
words.list <- strsplit( words.lower, " ") 
words.vector <- unlist( words.list )

# remove stopwords
stop.words <- stop_words[ 1: nrow( stop_words ), ]
stop.words.vector <- stop.words$word
words.vector <- gsub( stop.words.vector, "", words.vector )  

head( words.vector )
------------------------
# this returns me the following
argument 'pattern' has length > 1 and only the first element will be used
[1] ""           "beginner’s" "guide"      "to"         "word"       "embedding"

I am now able to count the word frequency but still have some noisy stop words. Is something wrong with my code? Thanks!

# count the number of words in words.vector
words.vector.table <- table( words.vector )
words.dt <- as.data.frame( words.vector.table ) 
words.dt[ order( -words.dt$Freq ), ]

# THIS IS THE RETURN
words.vector
<fctr>
Freq
<int>
11250   to  1713        
10893   the 1634        
1       1113        
4719    how 877     
7656    of  858     
7115    nd  738     
12697   your    635     
4995    in  615     
3868    for 600     
12639   you 527     

Also, removing stopwords seems that the functions only replace the matched words with null but leave the empty vector entry in the original string. So my question is how to remove both the values and their positions (entries) from the original string (or maybe subslice)?

lecy commented 3 years ago

This is the issue here:

words.vector <- gsub( stop.words.vector, "", words.vector ) 
Warning message:
In gsub(stop.words.vector, "", words.vector) :
  argument 'pattern' has length > 1 and only the first element will be used

The function gsub() is not vectorized, meaning it won't iterate through your stop.words.vector automatically.

As a result, it's only removing the first word in the list and ignoring the rest.

Here is a quick loops solution:

for( i in stop.words.vector )
{
  words.vector <- gsub( i, "", words.vector ) 
}

You are splitting titles into a words vector, then replacing stop words. This will result in lots of empty cases, as you have noted.

You could delete stop words, then split titles into word vectors.

Or you can just filter out empty cases after you remove all of the stop words.

words.vector <- words.vector[ words.vector != "" ]

You can even remove stop words from the table after created:

t <- table( words.vector )
t <- sort( t, decreasing=T )
d <- as.data.frame( t )
d <- d[ ! d$words.vector %in% stop.words.vector , ]
# d <- dplyr::filter( d, ! d$words.vector %in% stop.words.vector )   # dplyr version
lecy commented 3 years ago

These examples are why we start with data types and logical statements in CPP 526.

Hopefully you can see how data recipes are starting to take shape.

And how you need to be able to convert from one data type to another to build an effective recipe.

Also that there are always six dozen ways to do the same thing. Only a handful of ways to do it efficiently. And usually a couple of ways that are robust to changes in the code earlier in the script or edge cases like special characters.

Learning curve is (1) being able to read someone else's solution and see how it works, (2) create your own inefficient and clunky solution, (3) create your own efficient solution, (4) create your own robust solution that includes unit tests for certainty.

Every now and then you will see a solution that is 3 lines of bulletproof code for solving a complex data step. You immediately recognize them as tiny masterpieces, like a Shakespearian sonnet.

lghb2005 commented 3 years ago

Hi Dr.Lecy, thanks again for helping me out. I now know the trick that gsub() is not vectorized. Also, much appreciate the learning experiences in coding!