Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

P3-Q2 #33

Open AhmedRashwanASU opened 2 years ago

AhmedRashwanASU commented 2 years ago

P3-Q2. Take a random sample of 20 of the organizations in your sample and verify that your search criteria are identifying nonprofits that serve Black communities.

sample <- dplyr::sample_n( d.immigrant, 20 )

Report your rate of false positives in this sample (how many organizations in your sample do NOT belong there).

Any help here about how to Report your rate of false positives in this sample (how many organizations in your sample do NOT belong there), should we create corpus to our dataset which we created in the previous question.

AhmedRashwanASU commented 2 years ago

@voznyuky seems waiting to post this Q

voznyuky commented 2 years ago

@AhmedRashwanASU I'm meeting with the professor later today but I am also stuck on this question. Maybe I'll figure it out on here and not need the meeting after all :)

AhmedRashwanASU commented 2 years ago

this has to do with quanteda.textstats I guess lets wait for prof @lecy to save us .

mtwelker commented 2 years ago

I interpreted it to mean that I should review my sample of 20 and notice how many didn't belong there. So if, for example, 2 of those 20 were NOT organizations providing services to Black communities, I would report a false positive rate of 10%. But I'll wait to see what Prof. @lecy says.

voznyuky commented 2 years ago

I was thinking it's something simple, but my brain wants to over complicate everything. I'm sure it's something like that @mtwelker

AhmedRashwanASU commented 2 years ago

criteria.06 <- grepl( "black", sample$mission ) count( criteria.06 )

not sure if I'm constructing it correctly but the trues was 7 out of 10 samples, so means i have a 30% error rate, is that acceptable ? and how the same can be improved to decrease the error rate? is that all about the keywords we are using to identify the black communities? can we use the African American term will that help to minimize the error ?

lecy commented 2 years ago

Yes, @mtwelker is correct. Similar to last week you are matching patterns and those patterns can do a better or worse job at identifying in-group cases. For example, for the power lists:

grep( "^[0-9]", d$title, value=TRUE ) %>% head() 
## [1] "3 tips for successful UX Research"                                     
## [2] "6 Ways to Increase Your Engagement Rate on Instagram"                  
## [3] "2024 is Mr. Orwell’s 1984"                                             
## [4] "3 Keys to “Drive” Away Daily Stress"                                   
## [5] "3 Tips for Sticking to New Habits When Travelling"                     
## [6] "7 Concepts From Network Science To Make and Strengthen Key Connections"

The title "2024 is Mr. Orwell’s 1984" is not a list, even though it starts with a number, so this is a false positive. Here it is 5/6 correct, so a false positive case of 1/6 or 17%.

Improving our regular expression improves our precision:

grep( "^[0-9]{1,2} ", d$title, value=TRUE ) %>% head()
## [1] "3 tips for successful UX Research"                                     
## [2] "6 Ways to Increase Your Engagement Rate on Instagram"                  
## [3] "3 Keys to “Drive” Away Daily Stress"                                   
## [4] "3 Tips for Sticking to New Habits When Travelling"                     
## [5] "7 Concepts From Network Science To Make and Strengthen Key Connections"
## [6] "4 Lessons on Motivation from the Greek Hero Odysseus"

Note that the trade-off is that you are more likely to EXCLUDE cases that do belong, which is a FALSE NEGATIVE.

It's much harder to measure rates of false negatives because it might only be 1 out of 500 of the out-group cases. So I don't have you measure that explicitly. But the metric that I pay attention to is total true positives:

TOTAL CASES RATE OF TRUE POSITIVES (1000) (0.80) = 800 TRUE MISSIONS

If you are interested more in the formal definitions of error rates in classifiers:

https://nonprofit-open-data-collective.github.io/machine_learning_mission_codes/accuracy/

lecy commented 2 years ago

@AhmedRashwanASU

I'm not sure why you are recoding the sample$mission here?

criteria.06 <- grepl( "black", sample$mission )
count( criteria.06 )

You should be developing a set of missions that match your expressions, then manually confirming they are correct (like the example of titles I gave above).

It's more like:

criteria.01 <- grepl( "some expression", dat$mission ) 
criteria.02 <- grepl( "other expression", dat$mission ) 
group.logical.vec <- criteria.01 | criteria.02  # new logical vector

# only missions belonging to the group
group.missions <- dat$mission[ group.logical.vec ]  # returns a character vector 

set.seed( 123 )  # so you get the same sample each time
sample.missions <- sample( group.missions, size=20 )  # code these

Then manually inspect each of the 20 and decide how many are correct and how many are false positives. For example, the Blackrock Nature Preserve would describe an environmental nonprofit, not one that serves the Black community. So it is a false positive.

You should report your false positive rate as ( # false positives ) / 20.

You can show your work by creating a T/F vector where TRUE stands for true positive and FALSE stands for false positive, then include the sample of 20 in your document as:

x <- c( "mission 1", "mission 2", "mission 3" )
y <- c( T, F, T )
data.frame( x, y ) %>% pander()   # print table of results