DS4PS / cpp-527-fall-2020

http://ds4ps.org/cpp-527-fall-2020/
0 stars 1 forks source link

Lab 03 Q1 #15

Open krbrick opened 4 years ago

krbrick commented 4 years ago

Sorry guys I have no shame. Trying to calculate the mean log clap.score for my cleaned titles that begin with Wh (who what when why, etc).

the.Ws <- grep("^Wh", last.titles, value = T)
mean(try.combine$clap.score == "the.Ws")
mean(try.combine$clap.score)

which gives me (0). do I need to do a subset of clap.scores that equal the.Ws? thanks,

lecy commented 4 years ago

I would suggest using the grepl() function to define your group (it returns a logical vector instead of the positions or values of matches), then using the group to analyze the clap score.

You might revisit some past chapters or labs that analyze data by groups:

http://ds4ps.org/dp4ss-textbook/p-073-group-structure.html

https://ds4ps.org/cpp-526-sum-2020/labs/lab-02-instructions-v2.html

Or consider group_by() + summarize() in dplyr.

malmufre commented 4 years ago

I have a concern for question 1 as well , what I did is create new objects to store my titles by categories For Instance:

questions<-grep("\\?",new.titles, value = T) 
questions

I used this to find all titles that have questions in them, but then when I want to find the average performance I am not able to retrieve the questions object since it is not found in my data set. So how is it possible to add the new title categories to my data set and map them to the clap scores in my data set to be able to calculate the average?

krbrick commented 4 years ago

hello malmufre- I can't answer all of your questions, but I used cbind.data.frame to roll my new titles into the dataset. try.combine <- cbind.data.frame(last.titles, d$claps, d$reading_time, d$publication, d$date, d$subtitle) and then I checked the structure to make sure it was right by using: str(try.combine) best of luck!

krbrick commented 4 years ago

mean (try.combine$clap.score[last.titles == the.Wsss]) mean(try.combine$clap.score)

I would suggest using the grepl() function to define your group (it returns a logical vector instead of the positions or values of matches), then using the group to analyze the clap score.

You might revisit some past chapters or labs that analyze data by groups:

http://ds4ps.org/dp4ss-textbook/p-073-group-structure.html

https://ds4ps.org/cpp-526-sum-2020/labs/lab-02-instructions-v2.html

Or consider group_by() + summarize() in dplyr.

can you be more specific?

the.Wsss = grep("^Wh",last.titles, value = T)
mean (try.combine$clap.score[last.titles == the.Wsss])
Question.titles <- grep("\\?",last.titles, value = T)
mean(try.combine$clap.score [last.titles == Question.titles])
mean(try.combine$clap.score)
longer object length is not a multiple of shorter object length[1] 1.157985
longer object length is not a multiple of shorter object length[1] 2.170262
[1] 2.048821

Are these means ok even with this error?

class(the.Wsss)
class(the.Ws)
class(the.Wss)
[1] "character"
[1] "logical"
[1] "integer"

I can't seem to get a grip on dplyr, filter or group by these don't work:

try.combine %>%
  filter(try.combine$last.titles == the.Wsss)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))

try.combine %>%
  filter(try.combine$last.titles == the.Ws)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))

try.combine %>%
  filter(try.combine$last.titles == the.Wss)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))

try.combine %>%
  group_by(try.combine$last.titles == the.Wsss)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))%>%
  ungroup()`

try.combine %>%
  group_by(try.combine$last.titles == the.Wss)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))%>%
  ungroup()

try.combine %>%
  group_by(try.combine$last.titles == the.Ws)%>%
  summarise(ave = mean(clap.score)) %>%
  arrange(desc(ave))%>%
  ungroup()

someone please help me, I have been trying to figure to figure this out for two days and am still on question one, trying not to lose my mind

lecy commented 4 years ago

@malmufre Recall how you add vectors back to the original data frame:

dat$new.var <- new.var

Your problem is you are using grep() instead of grepl().

grep( ... , value=T ) will return all titles that match your criteria. grepl( ... ) will create a logical vector where each case that meets your criteria will be returned as TRUE, and cases that don't meet the criteria will be returned as FALSE.

Importantly, grepl() will return a vector that is the same length as your original data frame. grep() will return a shorter one. So when using grepl() you can add your newly constructed group back to your original data frame:

questions <- grepl( "\\?", new.titles )
d$questions <- questions

Or use it for analysis:

mean( clap.score[ questions ] )
mean( clap.score[ ! questions ] ) 

The function grep( ... , value=T ) is useful for refining your regular expressions because you can see exactly which titles match your current criteria to ensure your regular expressions are working, but it is not the right data structure for subsequent steps in your analysis.

I can't emphasize how important logical statements and logical vectors are for analysis in data science. The first step in many problems is defining your groups. Once defined, there are many easy ways to efficiently analyze your data:

http://ds4ps.org/dp4ss-textbook/p-050-business-logic.html

http://ds4ps.org/dp4ss-textbook/p-073-group-structure.html

lecy commented 4 years ago

@krbrick Please note the difference between grep() and grepl() (see the discussion above for differences).

The "L" in grepl() is for logical vector.

I still don't know what last.titles represents but I think this would solve your problem:

clap.score <- log( d$claps + 1 )
group.wsss <- grepl( "^Wh", last.titles )   # grepl, not grep
mean( clap.score[ group.wsss ] )

Your intuition is correct here but you are using the wrong equals operator:

the.Wsss <- grep( "^Wh", last.titles, value = T )
mean (try.combine$clap.score[last.titles == the.Wsss])

This is a very subtle but very important thing to note. When looking for matches in a vector you need to differentiate cases where you are searching for a single case from cases where you are matching against a set of cases:

vector == "A"
vector %in% c("A","B","C")

The fix to what you were doing before:

the.Wsss <- grep( "^Wh", last.titles, value = T )
mean ( clap.score[ last.titles %in% the.Wsss ] )
# mean (try.combine$clap.score[last.titles == the.Wsss])

See the difference?

krbrick commented 4 years ago

I do. poignantly and for posterity. Thanks for your guidance!