DS4PS / cpp-527-fall-2020

http://ds4ps.org/cpp-527-fall-2020/
0 stars 1 forks source link

Lab 3 - Q2 #20

Open ecking opened 3 years ago

ecking commented 3 years ago

Hi there,

So for question 2, I've created a table that shows the first word of every title. I'm trying to group by the title now and it's not doing anything.

The outcome just still shows the same first word of every time... no grouping.

Any ideas on what could be wrong?

lecy commented 3 years ago

I would need to see code...

But why do you need to group by title?

Have you first created a vector with only the first word in each title? At that point you can use a table function to figure out the frequencies.

JayCastro commented 3 years ago

In the homework I used word from the stringr package. Is that fine?

lecy commented 3 years ago

@JayCastro It is OK, but I would always try it without tidyverse functions first.

The problem with packages like dplyr, stringr, and lubridate is they make things TOO easy.

One really important skill to develop in R is thinking about how functions change data structures and why. In this case, character vectors are converted to lists by the strsplit() function.

This forces you to work with different data structures, which will make you a much better R programmer / analyst because you develop a deep understanding of the data and the intuition behind the code.

The tidyverse packages were designed to make coding more efficient, so they have lots of helper functions that convert results back into the original data type. So these two steps are equivalent:

library( stringr )
str_count( titles, boundary("word") )

# core R version:
# lapply applies the length function to each list element and returns a list

word.list <- strsplit( titles, " " )
word.count <- lapply( word.list, length )
word.count <- unlist( word.count ) # convert to a vector

# or more efficiently with core R code: 
# sapply applies the length function to each list element and returns a vector

titles %>% strsplit( " " ) %>% sapply( length )  

Except you learn how to work efficiently with lists using core R functions, you remain blissfully ignorant of any list operations when you use str_count() because the tidyverse has abstracted away from the underlying data structure.

You can learn to program faster using tidyverse functions because they are very intuitive, but there will be gaps in your understanding of the code and your ability to understand the underlying operations.

Core R is tedious, but the manual nature of breaking a problem into individual steps is helpful in the long run as you encounter issues that don't already have a convenient tidyverse function implemented. Otherwise as you mature in your career and are given harder problems to work on you will get stuck with higher frequency if you rely too heavily on tidyverse frameworks.

If you learn core R functions, however, it's easy to use tidyverse functions to scale your code quickly since you have a strong understanding of the underlying processes. So you lose nothing by focusing on core R operations when you are first learning to code - it just takes a bit longer to get comfortable.

JayCastro commented 3 years ago

Okay I will continue to work on this. Thank you!

JayCastro commented 3 years ago

I mostly used it cause i tried everything to find the first and last word but i will keep trying.

lecy commented 3 years ago

Copying sample code from another thread to help get you started:

# x is a single title
get_first_word <- function( x )
{
  # split title x into words
  # unlist results
  # select the first word
  # return first word
}

# test your function
x <- d$title[1]
get_first_word( x )

d2$title
[1] "A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model"
[2] "Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric"
[3] "How to Use ggplot2 in Python"                                   

# sapply applies the function to all titles in the vector
# the default prints the original title with the return values

sapply( d2$title, get_first_word )
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model 
                                                            "A" 
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric 
                                                     "Hands-on" 
                                   How to Use ggplot2 in Python 
                                                          "How" 
# only prints return values 
sapply( d2$title, get_first_word, USE.NAMES=FALSE )
[1] "A"        "Hands-on" "How"

Alternatively, you can use loops. This might be a little more intuitive for you right now, but a much less efficient approach in the long-run:

results <- NULL
for( i in 1:length(d$title) )
{
  results[i] <- get_first_word( d$title[i] )
}