Lab 4 - Part III - Githubissues

imanprs commented 4 years ago

Hello Prof @lecy

I have a couple of questions regarding part III of lab 4.

1) Do we have to exclude those whose activity code is humanities or was that only the case for the immigrants example? The description says "membership organizations for Black communities or provide services to Black communities", I am not sure what activity codes would cover membership organizations and service organizations.

2) Without excluding humanities, I used a set of criteria which leads to choosing 193 organizations (and they all seem to be really about the black community). This seems rather small given the size of the data. But, interestingly, the number of organizations with the word "black" in their missions is pretty close (186). Am I missing something?

test <- grepl("black", dat$mission) sum(test) [1] 186

lecy commented 4 years ago

(1) The terms "membership" and "services for" are meant broadly to be inclusive of all of the nonprofits that serve the Black community, meaning they can offer a community that people can join (doesn't have to be a formal membership organization - could be a club, sports team, or artistic group) and organizations that provide services targeting minorities inclusive or of exclusively Black communities.

(2) The question is asking you to apply some of the concepts related to disambiguation and compound words. For example, you will find "African American" and "Trans-African" organizations. One of the most famous organizations is the National Association for the Advancement of Colored People (NAACP). So the exercise is about finding ways to code textual data that produces consistent but comprehensive results.

If you get too flexible and inclusive with your search terms you will identify organizations that are outside of your sample, but too narrow and you miss the majority of orgs that should be in your sample.

gzbib commented 4 years ago

Hello everyone,

I still have some concerns for part 3:

1- How should I know what to exclude from my criteria? like "humanities" in the immigrant's example. 2- When I used grep (), I was getting correct and matching results. However, when I started constructing the groups, and then I printed my logical statements, I noticed that all of them are FALSE. Though, I got a sum of 205 which doesn't make sense. 3- After taking a sample of 20, what if the organizations that are chosen randomly, don't meet any of the criteria I assigned earlier? What I did is that I repeated the grepl () function on the sample I chose and the criteria, but I didn't get matching results, only one matching answer. 4- What do you mean by the rate of false positives?

Thanks a million

lecy commented 4 years ago

@gzbib Can you please provide some code? It's hard to tell what's not working without it.

When I used grep (), I was getting correct and matching results. However, when I started constructing the groups, and then I printed my logical statements, I noticed that all of them are FALSE. Though, I got a sum of 205 which doesn't make sense.

What is your code?

After taking a sample of 20, what if the organizations that are chosen randomly, don't meet any of the criteria I assigned earlier?

Again, code on how you generate your sample would be helpful.

If the sampling within results is correct and none of the examples meet your requirements it means your code to identify groups is deficient - probably too inclusive so it identifies LOTs of orgs that are outside of the scope of your intended sample.

What do you mean by the rate of false positives?

How many results in the group of 20 are not orgs that you wanted in your sample. It's a positive because it was identified by your search functions (grep), and false positive because it does not actually belong to the group you are trying to define.

Text is messy because it is not precise. There will be a trade-off to being TOO inclusive in your regular expressions in order to identify all of the orgs you want in your study (true positives), but in doing so you are including a higher rate of false positives. The trick is to find expressions that balance these two things.

gzbib commented 4 years ago

Hello Sir, yes sure let me elaborate with some code:

1- In the immigrants example, we did this: criteria.05 <- ! grepl( "humanities", dat$mission ) # exclude humanities

My concern is that based on what we decided to exclude humanities? Should we like search in the dataset for terms that make me narrow down my criteria?

2- I think I got it here, I thought I was getting only FALSE while running this code:

criteria.07 <- grepl( "african american", dat$mission, ignore.case = TRUE )
criteria.07

However, I was able to detect some TRUE answers, I thought I was getting only FALSE.

I am not sure If I should be that specific in my criteria:

criteria.01 <- grepl( "women of color", dat$mission, ignore.case = TRUE ) 
criteria.02 <- grepl( "national black", dat$mission, ignore.case = TRUE ) 
criteria.03 <- grepl( "black boys", dat$mission, ignore.case = TRUE ) 
criteria.04 <- grepl( "black women", dat$mission, ignore.case = TRUE )
criteria.05 <- grepl( "black community", dat$mission, ignore.case = TRUE )
criteria.06 <- grepl ("black", dat$mission, ignore.case = TRUE)
criteria.07 <- grepl( "african american", dat$mission, ignore.case = TRUE )

What if the above ones are not in the sample that I am going to choose?

3- The sample I chose:

dat$mission <- tolower( dat$mission )
dat.sample <- dat[ sample( 1:1000, size=20) , ]
dat.sample

I realized that the organizations that were chosen in the sample include only 2 that serve black communities.

4- Noted

DS4PS / cpp-527-fall-2020

Lab 4 - Part III #25