DS4PS / cpp-527-fall-2020

http://ds4ps.org/cpp-527-fall-2020/
0 stars 1 forks source link

Lab 3 Q3 word count error #19

Open JasonSills opened 4 years ago

JasonSills commented 4 years ago

Hi @lecy ,

I'm having some difficulty setting up Q3. Following the notes from this week I've created the following code: lower.words <- tolower (cleaned.titles) title.words.list <-strsplit( lower.words, " " ) title.word.count <- length( strsplit(lower.words, split="")[[1]])

However, title.word.count results in 63. This is the count of the first title only and is what I would expect from the nchar function because it's the count of each letter, not each word. Where am I going wrong?

JasonSills commented 4 years ago

Found the answer using stringr.

lecy commented 4 years ago

What you have here is correct:

title.words.list <-strsplit( cleaned.titles, " " )
title.word.count <- length( title.words.list[[1]] )

But as you point out, this only counts words in the first title. To apply this operation to all titles you need to use lapply():

# applies the length function to each element in the list
title.word.count <- lapply( title.words.list, length )

And you can unlist at the end to turn it back into a vector:

d2$title
[1] "A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model"
[2] "Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric"
[3] "How to Use ggplot2 in Python"                                   

title.words.list <- strsplit( d2$title, " " )

length( title.words.list )  # length of the list
[1] 3

length( title.words.list[[1]] )  # num words in first title
[1] 9

lapply( title.words.list, length )  # num words in all titles
[[1]]
[1] 9

[[2]]
[1] 9

[[3]]
[1] 5

title.word.count <- lapply( title.words.list, length ) 
title.word.count <- unlist( title.word.count )
title.word.count
[1] 9 9 5
JasonSills commented 4 years ago

@lecy I see, thanks.

Something interesting happened. I used stringr and built this code: library(stringr)

title.word.count <- str_count(cleaned.titles, boundary("word")) title.word.count twcmodel <- lm(d$claps ~ title.word.count, data = dat) summary(twcmodel)

But the output is recording one extra word in each string compared to the code above. What is causing this?

JasonSills commented 4 years ago

@lecy I printed the titles and counted the first three. The stringr calculation that gives one more count is correct. This is, of course, changing my linear model. I'm going to stay with the stringr count since that is correct, just want you to know that it won't match your linear model with the above count.

lecy commented 4 years ago

That is really strange. There are special spaces being used near the end of words:

> first.six.titles <- head( d$title )
> strsplit( first.six.titles, " " )
[[1]]
[1] "A"              "Beginner’s"     "Guide"          "to"             "Word"           "Embedding"      "with"          
[8] "Gensim"         "Word2Vec Model"

[[2]]
[1] "Hands-on"  "Graph"     "Neural"    "Networks"  "with"      "PyTorch"   "&"         "PyTorch"   "Geometric"

[[3]]
[1] "How"       "to"        "Use"       "ggplot2"   "in Python"

[[4]]
 [1] "Databricks:"    "How"            "to"             "Save"           "Files"          "in"            
 [7] "CSV"            "on"             "Your"           "Local Computer"

[[5]]
[1] "A"               "Step-by-Step"    "Implementation"  "of"              "Gradient"        "Descent"        
[7] "and"             "Backpropagation"

[[6]]
[1] "An"           "Easy"         "Introduction" "to"           "SQL"          "for"          "Data"        
[8] "Scientists"  

If you copy the space in the words that fail to split apart like "Word2Vec Model" and compare them to other spaces you get:

" " == " "  # regular spaces
[1] TRUE
" " == " "  # cases that fail to split
[1] FALSE

This fixes it, but I am really curious why Medium uses the weird space characters only near the end of titles.

first.six.titles <- gsub( " ", " ", first.six.titles )
strsplit( first.six.titles, " " )
[[1]]
 [1] "A"          "Beginner’s" "Guide"      "to"         "Word"       "Embedding"  "with"       "Gensim"     "Word2Vec"  
[10] "Model"     

[[2]]
[1] "Hands-on"  "Graph"     "Neural"    "Networks"  "with"      "PyTorch"   "&"         "PyTorch"   "Geometric"

[[3]]
[1] "How"     "to"      "Use"     "ggplot2" "in"      "Python" 

[[4]]
 [1] "Databricks:" "How"         "to"          "Save"        "Files"       "in"          "CSV"         "on"         
 [9] "Your"        "Local"       "Computer"   

[[5]]
[1] "A"               "Step-by-Step"    "Implementation"  "of"              "Gradient"        "Descent"        
[7] "and"             "Backpropagation"

[[6]]
[1] "An"           "Easy"         "Introduction" "to"           "SQL"          "for"          "Data"        
[8] "Scientists" 

Thanks for pointing this out!

lecy commented 4 years ago

Also highlights the joys of working with text as data :-)

lecy commented 4 years ago

I suspect these are the same hair spaces that appeared in the original cleaning steps as well.

<U+200A>—<U+200A>