DS4PS / cpp-527-spr-2020

Course shell for CPP 527 Foundations of Data Science II for Spring 2020.
http://ds4ps.org/cpp-527-spr-2020/
0 stars 1 forks source link

LAB 04 #8

Open sunaynagoel opened 4 years ago

sunaynagoel commented 4 years ago

Part 1

3

I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

, , , , and [1] 3464

trimws(dat$mission, "r")

Even after running this code the command

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

it return the same result. , , , , and [1] 3464

Not sure, what is going wrong.

sunaynagoel commented 4 years ago

@lecy not sure what happened there but anciently I opened two LAB 04 issues. I closed one but thought you may want to delete it.

lecy commented 4 years ago

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:

dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()

Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

sunaynagoel commented 4 years ago

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:

dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()

Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

Thank you this helps. As far as "strings with trailing spaces" does this work ? "^.+\t\n\r\f$" I was trying to replace \t\n\r\f with \s but my R is recognizing it.

lecy commented 4 years ago

It could be a wildcard, or a selector set. I have not tried this code so this is more pseudocode.

"* $"
"[alphanumeric] ^"

But basically something that says "any letter number or punctuation, then a space, then end of line."

sunaynagoel commented 4 years ago

Part II

I have conceptual question about creating the dictionary. What happens if we don't create a dictionary for our corpus? Also, just to be clear this create dictionary code (which is provided) is trying find compound word and putting them under one header. For eg; non_profit=c("non-profit", "non profit"), This is asking R to look for "non-profit" or "non profit" and put it under one category of non_profit?

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )
lecy commented 4 years ago

The dictionary simplifies the data by turning these compound words into a single word. It's part of disambiguation.

If you don't apply it, your data is just a little noisier. It depends on the application - if you are very interested in a specific concept in your corpus ("President Bush") you might spend a lot of time making sure you capture all of the variants ("GW", "George W Bush", "Bush Jr", NOT "George HW Bush", etc.).

And correct - the dictionary is mapping all of the phrases on the right to the single term on the left. It is a find-and-replace operation.

jmacost5 commented 4 years ago

I am not understanding how to solve the second part, I am getting confused how to start it.

sunaynagoel commented 4 years ago

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps. ~Nina

lecy commented 4 years ago

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria:

Use the full mission dataset, not the small sample used in the demo. Add at least ten concepts to your dictionary to convert compound words into single words. Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

sunaynagoel commented 4 years ago

Challenge Question

@lecy When I try to look inside code01 to get an idea how to divide it into better three sub-sectors, I find only one value "A" in all the entries. My questions are: a. How to I divide in sub-sectors if all the values are identical? b. Am I reading the question wrong ?

lecy commented 4 years ago

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614 
sunaynagoel commented 4 years ago

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614 

Thank. I had to reload the dataset but now it is showing all the value.

jmacost5 commented 4 years ago

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria: Use the full mission dataset, not the small sample used in the demo. Add at least ten concepts to your dictionary to convert compound words into single words. Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

jmacost5 commented 4 years ago

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps. ~Nina

I am not understanding if I am missing something when it comes to removing the compound words in the examples the removing of code is completely different

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# remove punctuation 
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens )
sunaynagoel commented 4 years ago

@lecy The packages igraph and networkD3 are not available for the R version I have (3.6.1). Is there any way around ?

lecy commented 4 years ago

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph

devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")

NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

lecy commented 4 years ago

@jmacost5

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

Here is the step where you translate compound words into single words:

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

Your job is to generate n-grams to find phrases that should be combined into single words. That step helps generate options for you to explore, then you would manually translate your selections to the dictionary list. You will add additional phrases or words to the dictionary similar to the examples:

non_profit=c("non-profit", "non profit")

When applied these multi-word phrases are replaced in the text.

These steps are other pre-processing steps:

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# splits each sentence into a list of words
# remove punctuations first
tokens <- tokens( corp, what="word", remove_punct=TRUE )

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )

Try: help( tokens_compound ) when quanteda is loaded. It will take you to the documentation files.

jmacost5 commented 4 years ago

I wanted to know if there was a way to see the terms that are on the document or am I missing it from what we previously did ? I am trying to identify the terms used for the third part.

lecy commented 4 years ago

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:

dat$mission

If you want to browse in a spreadsheet view you can type:

View( dat )

Or you could write the data as a CSV file and open in excel:

getwd()  # where file will write to
write.csv( dat, "missions.csv" )
castower commented 4 years ago

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks! Courtney

jrcook15 commented 4 years ago

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks! Courtney

Hi Courtney,

I tried, "^[Tt]+[Oo] " I believe it worked.

castower commented 4 years ago

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

jrcook15 commented 4 years ago

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

castower commented 4 years ago

@jrcook15 ah, okay!! Thank you :)

jmacost5 commented 4 years ago

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:

dat$mission

If you want to browse in a spreadsheet view you can type:

View( dat )

Or you could write the data as a CSV file and open in excel:

getwd()  # where file will write to
write.csv( dat, "missions.csv" )

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

castower commented 4 years ago

Hello all, I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:

dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )

But I keep getting this error message:

Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'

I'm not sure how to fix this.

sunaynagoel commented 4 years ago

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

@jrcook15 @castower using ignore.case=T also works to make sure to or TO or To or tO all are considered. Also instead of using a space after "o" I used \b instead. It seemed to work for me. ("^to\b", x=dat$mission, value = TRUE, ignore.case = T)

sunaynagoel commented 4 years ago

Hello all, I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:

dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )

But I keep getting this error message:

Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'

I'm not sure how to fix this.

@castower I was getting the same error as well. But removing the criteria worked for me dat$mission <- trimws( dat$mission, "r")

castower commented 4 years ago

@sunaynagoel that fixed it! I've been working on this for hours now, lol. Thank you so much!

lecy commented 4 years ago

@castower @jmacost5 Note that the grep() family of functions contain an ignore case argument:

grep( pattern, x, ignore.case = FALSE, ... )

This is very clever though!

"^[Tt]+[Oo] "
lecy commented 4 years ago

@jmacost5

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

That is the hard and interesting part of the assignment. One thing you learn quickly when working with text is the usefulness of iteration. We know that "black" is ambiguous (it could be used for a lot of things in mission statements), but "African American" is probably not. So search for missions that contain that term, then look for other key words or phrases.

You just keep adding phrases until the process is not improving outcomes much at all.

You can also google some topics to try and find some words or phrases. If you try "nonprofit + african american" you get:

https://www.huffpost.com/entry/28-organizations-that-are-empowering-black-communities_n_58a730fde4b045cd34c13d9a

This gives you ideas like "black heritage", "black lives", and "women of color". It will just be trial and error, making sure you don't add words that add non-matches and trying no to miss words that add a lot of matches.

sunaynagoel commented 4 years ago

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph

devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")

NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

This is a package required for challenge question to make word networks.

castower commented 4 years ago

Hello all, I'm currently working on summarizing my corpus data and I got the following error message:

nsentence() does not correctly count sentences in all lower-cased text

Is this okay? I still have a table produced, but I don't know if this will cause problems.

Thanks!

lecy commented 4 years ago

@castower That warning occurs at this step?

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

You can omit this step, and later convert tokens to all lower case:

# convert missions to all lower-case 
dat$mission <- tolower( dat$mission )

# after tokenization before counting terms: 
tokens <- tokens_tolower( tokens, keep_acronyms=TRUE )

But substantively it would not impact much for this lab to leave it in the original order. I suspect your results would not change either way.

lecy commented 4 years ago

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

castower commented 4 years ago

Yes, thank you!

On Sun, Feb 23, 2020, 8:55 PM Jesse Lecy notifications@github.com wrote:

@castower https://github.com/castower That warning occurs at this step?

remove mission statements that are less than 3 sentences longcorp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

You can omit this step, and later convert tokens to all lower case:

convert missions to all lower-case dat$mission <- tolower( dat$mission )

after tokenization before counting terms: tokens <- tokens_tolower( tokens, keep_acronyms=TRUE )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DS4PS/cpp-527-spr-2020/issues/8?email_source=notifications&email_token=AM6K2WQG2PXTXBJFRROVTOLRENHMVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWT3EA#issuecomment-590167440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6K2WS2I3ZGE3DZPPV7SR3RENHMVANCNFSM4KYTWC4Q .

castower commented 4 years ago

@lecy for some reason,

dat$mission <- tolower( dat$mission )

did not work, however I changed it to

corp <- tolower( corp )

and it worked fine. It did change my final numbers of frequency for the top 10 keywords slightly, but not very much.

Thanks!

sunaynagoel commented 4 years ago

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

I was able to download igraph. Thanks

lecy commented 4 years ago

@castower Ok, great. I'll make a note of that error.

It makes sense why removing capitalization would hinder efforts at automatically indentifying sentences. Humans would be pretty good at knowing a sentence had ended, but for computers you might have periods representing abbreviation in the middle of a sentence.

Acme Inc. has good toys online.

So period followed by lower-case suggests it is mid-sentence. If you remove cases that would be hard to identify, so you would end up with different splits.

The joys of text analysis!

castower commented 4 years ago

@jmacost5 a key term you can search for a lot of organizations that serve Black/African American populations is diaspora or more specifically African/Black diaspora. If you search diaspora generally, you can filter out the organizations that refer to other diasporas. Hope that helps!

castower commented 4 years ago

@lecy to clarify, for the challenge question are we examining the subset database of data related to the organizations serving Black communities or the entire database? Thanks!

jmacost5 commented 4 years ago

I am trying to knit my lab and i keep getting this error even thought I have all my packages updated and installed. Screen Shot 2020-02-24 at 9 07 54 AM

lecy commented 4 years ago

@castower The challenge questions would use the entire database.

lecy commented 4 years ago

@jmacost5 Did you include dplyr in your load libraries chunk?

jmacost5 commented 4 years ago

@jmacost5 Did you include dplyr in your load libraries chunk?

Yes I did and I included it in my code, I am getting an error about 'corp' now, is there a package that I am missing, I put dpylr, pander, and quantda

lecy commented 4 years ago

"quanteda" or "quantda" ?

jmacost5 commented 4 years ago

"quanteda" or "quantda" ?

quanteda

lecy commented 4 years ago

I would need more to go to diagnose the problem (you haven't provided a lot of information or your code so it is a bit of a guessing game). Do you want to send me the RMD file?

castower commented 4 years ago

@lecy, I thought so, but wanted to check! Thank you so much!

On Mon, Feb 24, 2020, 8:11 AM Jesse Lecy notifications@github.com wrote:

@castower https://github.com/castower The challenge questions would use the entire database.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DS4PS/cpp-527-spr-2020/issues/8?email_source=notifications&email_token=AM6K2WQV4FOGSKN7NLSEAJDREPWUVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYOYQA#issuecomment-590408768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6K2WV77WGVCLIJTX3EJK3REPWUVANCNFSM4KYTWC4Q .

castower commented 4 years ago

Hello, I've run into an error with part two of the assignment. I'm currently working on trying to create a network for arts and I have the following code:

#tokens.cat2 are the tokens I got from Part 1 of the assignment and I'm reusing them here
arts.token.list <- as.list(tokens.cat2)
arts.token.list <- lapply( arts.token.list, function(x){ x[ ! grepl( "^$", x ) ] } )
arts.token.list[[1]]
listToNet <- function( x )
{

   word.pairs <- list()

   for( i in 1:length(x) )
   {
      x.i <- x[[i]]
      word.pairs[[i]] <- NULL
      if( length( x.i ) > 1 ) { word.pairs[[i]] <-  data.frame( t( combn( x.i, 2) ) ) }
      if( length( x.i ) > 1 ) { names( word.pairs[[i]] ) <-  c("from","to") }
   }

   return( word.pairs )

}

g.list1 <- listToNet( arts.token.list )
head( g.list1[[1]] )
# I created this function because there was not an option for if the organizations were art related or not, so I have a 1 assigned to those organizations that do not have 'Arts' at the start of their activity code as we created for the activity code variable in the assignment.

 dat$art <- ifelse( grepl( "^art", dat$activity.code, ignore.case = T ), 1, 0) 
table( dat$art, useNA="ifany" )
g.list.1 <- g.list1[ dat$art == 1 ]
m1 <- bind_rows( g.list.1 )
length( g.list.1 )
g.list.2 <- g.list1[ dat$art == 0 ]
m2 <- bind_rows( g.list.2 )
length( g.list.2 )

All the previous code works, but then when I reach this code:

g.art.yes <- graph.edgelist( as.matrix(m1), directed=FALSE )
g.art.no <- graph.edgelist( as.matrix(m2), directed=FALSE )

summary( g.art.yes )
summary( g.art.no )
I get the following error: 
Error in graph.edgelist(as.matrix(m2), directed = FALSE) : graph_from_edgelist expects a matrix with two columns

@lecy