Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Lab3-Q2-counting 1st words #18

Open Jana-Ajeeb opened 2 years ago

Jana-Ajeeb commented 2 years ago

Hello, I'm trying to run this code to get the 1st word out of each title but it's just returning the 1st word of the 1st title:

{r}

results <- NULL
for( i in 1:length(d$title) )
{
  d$title <- tolower( d$title )
ccv <- d$title[i]
word.list <- strsplit( ccv, " " )   # split title x into words
word.vector <- unlist( word.list )   # unlist results
results[i]<- split(word.vector, " ")[[i]][i]
i <- i+1
return(results)
}
print(results)

and the output: [1] "a"

lecy commented 2 years ago

A few things:

(1) You need to practice breaking open functions and loops and stepping through them line by line. It's the only way you can really understand them. See below.

(2) You will sometimes use counters in while loops but you do not need them in for loops generally because i will increment on its own - you don't need to add i+1 at the end.

(3) You use return() statements inside of functions but not inside of loops.

Breaking the LOOP open:

###

i <- 1
d$title <- tolower( d$title )
head( d$title )
[1] "a beginner’s guide to word embedding with gensim word2vecâ model"   
[2] "hands-on graph neural networks with pytorch & pytorch geometric"      
[3] "how to use ggplot2 inâ python"                                        
[4] "databricks: how to save files in csv on your localâ computer"         
[5] "a step-by-step implementation of gradient descent and backpropagation"
[6] "an easy introduction to sql for data scientists"                      

ccv <- d$title[i]
ccv
[1] "a beginner’s guide to word embedding with gensim word2vecâ model"

word.list <- strsplit( ccv, " " )   # split title x into words
word.list
[[1]]
[1] "a"               "beginner’s"    "guide"           "to"              "word"           
[6] "embedding"       "with"            "gensim"          "word2vecâ model"

word.vector <- unlist( word.list )   # unlist results
word.vector
[1] "a"               "beginner’s"    "guide"           "to"              "word"           
[6] "embedding"       "with"            "gensim"          "word2vecâ model"

word.vector[1]
[1] "a"

results[i] <- word.vector[1]

###

Putting it back together:

results <- NULL
for( i in 1:length(d$title) )
{

    ###

    # i <- 1
    d$title <- tolower( d$title )

    ccv <- d$title[i]

    word.list <- strsplit( ccv, " " )   # split title x into words

    word.vector <- unlist( word.list )   # unlist results

    results[i] <- word.vector[1]

    ###

    results[i] <- split(word.vector, " ")[[i]][i]

    ##  i <- i+1         ## you don't need counters in for loops
    ##  return(results)  ## you don't return from a loop
}
Jana-Ajeeb commented 2 years ago

I tried this:

{r}

results <- NULL
for( i in 1:length(d$title))
{
  #i <- 1
  d$title <- tolower( d$title )
  ccv <- d$title[i]
  word.list <- strsplit( ccv, " " )   # split title x into words
  word.vector <- unlist( word.list )   # unlist results
  results[i] <- word.vector[i]
  results[i] <- split(word.vector, " ")[[i]][i]
}
return(results)

but also this error appeared: Error in split(word.vector, " ")[[i]] : subscript out of bounds

and I understand the concept but i'm not getting how we can play with "i" so that it can return 1st word of reach sentence."

lecy commented 2 years ago

Getting closer.

Note that unlist() converts the list version of the sentence back into a regular character vector.

word.vector <- unlist( word.list )   # unlist results
word.vector
[1] "a"               "beginner’s"    "guide"           "to"              "word"           
[6] "embedding"       "with"            "gensim"          "word2vecâ model"

You then need to extract the first word from that vector.

You have this:

results[i] <- word.vector[i]

You want the 1st word, not the ith word:

results[i] <- word.vector[1]

The subscript out of bounds error occurs when i is larger than length(word.vector) because there is no string to return then.

I'm not sure why you include the last line - it is redundant with previous steps and will overwrite the results. You can delete it.

results[i] <- split(word.vector, " ")[[i]][i]

You also should move this outside of the loop:

d$title <- tolower( d$title )

It doesn't hurt anything, but you are converting the same titles to lower case 6,500 different times when once will do. It adds run-time to your code.

Jana-Ajeeb commented 2 years ago

noted thanks a lot!!