Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Lab3 - Q1 #15

Open WSKQ23 opened 3 years ago

WSKQ23 commented 3 years ago

Hello @lecy I am trying to put together Lab3-Q1, but I am confused about how to get all the titles listed in Q1. I tried to get all values that start with “How” using

How <- grepl("How", d$title) How

I get the values with How, but I am thinking of reducing them too many.

lecy commented 3 years ago

The grepl() function is correct.

You need to refine your query a little more. You will use a regular expression operator to identify only titles that BEGIN WITH "how".

Jana-Ajeeb commented 3 years ago

Hello,

I'm trying to run this:

Ques <- grepl("\\?", d$title)

But this error is showing:

Error in d$title : $ operator is invalid for atomic vectors

lecy commented 3 years ago

What does your d object contain?

class( d )
head( d )
Jana-Ajeeb commented 3 years ago

Thanks Dr. but I figured it out, i was using another object called d so they got mixed up

lecy commented 3 years ago

Too many objects in the kitchen!

lecy commented 3 years ago

See if you can come up with a more generic solution for HTML tags.

What pattern do they all follow? Can you write an expression to identify that pattern?

These can all be treated like regular words:

4chan/pol, r/Braincels, and r/TheRedPill

They are referencing specific chat rooms on Redit and 4chan.

lecy commented 3 years ago

Note that + is a regex operator so it needs to be escaped.

Does <U work without the +?

lecy commented 3 years ago

Two examples for the HTML tags:

# HTML TAGS THAT CONTAIN QUOTES: 
<strong class=\"markup--strong markup--h3-strong\">

# DOES NOT WORK BECAUSE STRING GETS BROKEN UP 
"<strong class=\"markup--strong markup--h3-strong\">"
"<strong class=\"       markup--strong markup--h3-strong\">"

# SOLUTION 
d$title <- gsub( '<strong class=\"markup--strong markup--h3-strong\">', "", d$title )

# ESCAPE CHARACTERS TO DO A LITERAL SEARCH FOR REGEX OPERATOR +
d$title <- gsub( "<U\\+200A>—<U\\+200A>", "", d$title )
lecy commented 3 years ago

Regarding the other case, how would you define the OTHER group here where OTHER means does not belong to groups A, B, or C?

Hint, you do NOT define it with a regular expression.

df <- data.frame( ID, A, B, C )
df

ID  A   B   C
1   1   0   0
2   0   1   0
3   0   0   1
4   0   0   0
5   0   0   0

Cases 4 and 5 should belong to OTHER group.

lecy commented 3 years ago

grepl() returns a regular logical vector (all T or F ). Recall how we combine logical vectors:

df <- data.frame( ID, A, B, C )
df

ID  A   B   C
1   T   F   F
2   F   T   F
3   F   F   T
4   F   F   F
5   F   F   F

group <- A | B | C   # belongs to any one of the three
other <- ! ( A | B | C )  # doesn't belong to any 
lecy commented 3 years ago

You are getting a NaN (not a number) because you are trying to do mathematical operations with character vectors, I think.

What does this return?

d$title == "power.group"

You are close, but you are forgetting how to combine logical vectors with other vectors.

Your group is already a logical vector, so this is not meaningful for two reasons - first, when you put quotes around "power.group" then it becomes a string and not an object name, and second if you are comparing a character vector to a logical vector the results would be not very meaningful.

d$title == "power.group"

# combining character and logical 
# 
# "a" == TRUE
# "b" == TRUE
# "c" == FALSE 

x1 <- c("a","b","c")
x2 <- c(T,T,F)

x1 == x2
[1] FALSE FALSE FALSE   # this is where your NaN comes from

Instead use the group vector to subset the clap score directly:

# compare outcomes by group
mean( clap.score[   group.name ] )  # average score for group members
mean( clap.score[ ! group.name ] )  # score for titles not in the group 

You still might need to add the na.rm=TRUE argument to mean(). I don't recall if there are missing values or not.

Make sense?

# equivalent dplyr approach 

d$clap.score <- log( d$claps + 1 ) 
d$group.name <- grepl( ... )

d %>% 
  group_by( group.name ) %>% 
  summarize( ave=mean( clap.score ) )
lecy commented 3 years ago

How do you access the last element of a vector?

length( word.vector ) # number of words
word.vector[1] # first word
lecy commented 3 years ago

What do you change it to though? First is easy because all vectors have a first element. But titles are different lengths so the code needs to be dynamic to select the last element.

You could reverse the order and then select the first position again.

Or you can use length to find the last position.

lecy commented 3 years ago

It's actually just:

word.vector[ length(word.vector) ]