DS4PS / cpp-527-spr-2020

Course shell for CPP 527 Foundations of Data Science II for Spring 2020.
http://ds4ps.org/cpp-527-spr-2020/
0 stars 1 forks source link

LAB 04 #8

Open sunaynagoel opened 4 years ago

sunaynagoel commented 4 years ago

Part 1

3

I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

, , , , and [1] 3464

trimws(dat$mission, "r")

Even after running this code the command

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

it return the same result. , , , , and [1] 3464

Not sure, what is going wrong.

lecy commented 4 years ago

@castower Can you send me the file by email please?

Note that in the dataset code01 and codedef01 tell you the subsectors if you want to identify them that way:

> head( dat )
        ein                           orgname
1 311767271              NIA PERFORMING ARTS 
2 463091113       THE YOUNG ACTORS GUILD INC 
3 824331000                   RUTH STAGE INC 
4 823821811 STRIPLIGHT COMMUNITY THEATRE INC 
5 911738135       NU BLACK ARTS WEST THEATRE 
6 824668235     OLIVE BRANCH THEATRICALS INC 
                                                                                                                                                                                                                                       mission
1                                                                                                                                         a community based art organization that inspires, nutures,educates and empower artist and community.
2         we engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. we produce community theater productions for children as well as educational theater camps and workshops.
3                                                                                                                                                                                                     theater performances and performing arts
4                                                                                                                                                                                                                                             
5                                                                                                                                                                                                                                             
6 to produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.
  code01                     codedef01 code02 codedef02 orgpurposecharitable
1      A Arts, Culture, and Humanities    A65   Theater                    1
2      A Arts, Culture, and Humanities    A65   Theater                    0
3      A Arts, Culture, and Humanities    A65   Theater                    1
4      A Arts, Culture, and Humanities    A65   Theater                    1
5      A Arts, Culture, and Humanities    A65   Theater                    1
6      A Arts, Culture, and Humanities    A65   Theater                    0
castower commented 4 years ago

@lecy thanks! I just sent over my RMD file. I will look into using the codes. -Courtney

lecy commented 4 years ago

@castower Just sent it back. A preview of one of the semantic networks:

image

castower commented 4 years ago

Hello all,

So I have been working with the stringr functions a little more and I'm a bit confused what I'm doing wrong.

I have created the following test data set:

test <- c("hello my name is Courtney")

and I am trying to extract everything after 'hello' so that I can get an output of

my name is Courtney

However, when I run the following:

str_extract_all(test,"(?<=hello )\\S{0,}") 

All that I'm getting is:

[[1]]
[1] "my"

Any tips on what I am doing wrong?

castower commented 4 years ago

Also, as a note, I tried str_split instead and for some reason, it deletes 'my':

code:

test <- c("hello my name is Courtney")
str_split(test,"(?<=hello )\\S{0,}") 

Output:

[[1]]
[1] "hello "            " name is Courtney"
sunaynagoel commented 4 years ago

str_extract_all(test,"(?<=hello )\S{0,}")

@castower I am not sure what is wrong. But your question was interesting enough for me to try it out on my own. When I run this code, it eliminates all the 'h''e''l'and 'o' from the entire string. It is not treating 'hello' as a word. I tried \b as well but did not work

str_extract_all(test,"([^hello])") 

[[1]] [1] " " "m" "y" " " "n" "a" "m" " " "i" "s" " " "C" "u" "r" "t" "n" "y"

But when I try word (),it works

word(test, 2,-1)

[1] "my name is Courtney"

It would be nice to write a function which can eliminate first word in every sentence.

lecy commented 4 years ago

I am going to be honest that when people immediately default to tidyverse packages I feel a little like an old man. Get off my lawn, Hadley Wickham!

I find that tidy packages are really great at scaling operations. Once you know how to do them, then they make it easier and faster to accomplish. They are not always great when you are just learning a new skill in R because they try to be clever and protect you from some of the complicated parts of the code, and they are written in a way that tries to generalize each step at scale. As a result, you lose some of the intuition about what is happening.

For example, group_by( f1, f2 ) %>% mutate( n=n() %>% ungroup() is super easy and efficient to write, but behind the scenes the data is being split into many smaller datasets, variables are summarized on subsets, and then everything is recombined in a way that reconciles all of the dimensions correctly. The actual process is not obvious to the neophyte. I used to have to do all of the steps individually, so now I see how great that code is and how much time it saves.

So let me conclude this soap box by saying it is sometimes helpful to start with core R functions because they tend to operate at the most basic level, and can be helpful for understanding problems.

Your issue here, you want a process to remove the first word from each sentence. My question would be, what is your pseudocode. What do you mean by first word? Does it have to be "hello", or can it be any word? How do you operationalize the first word?

Try something like this:

> test <- c("hello my name is Courtney")
> 
> # non-generalizable version - just remove hello
> gsub( "^hello ", "", test )
[1] "my name is Courtney"
> 
> # > args( strsplit )
> # function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
> # split everything seperated by a space into distinct words
> # this is what "tokenization" does
> #
> x.split.list <- strsplit( test, " " )
> x.split.list
[[1]]
[1] "hello"    "my"       "name"     "is"       "Courtney"
>
>
> # extract the vector from the list 
> x.split <- x.split.list[[ 1 ]]
> new.x <- x.split[ -1 ]  # drop first word
> new.x
[1] "my"       "name"     "is"       "Courtney"
> 
> # combine vector elements back into a single string:
> # when you add collapse as an argument to paste it 
> # mashes all elements of a vector into a single string 
> 
> paste0( new.x, collapse=" " )  
[1] "my name is Courtney"

Note that parentheses in regular expressions are not like putting things in quotes. It actually atomizes the words in the parentheses into individual letters rather than isolating the specific word. So this expression:

gsub( "^hello ", "", test )
x.split.list <- strsplit( test, "[hello]" )

Would split all of the text by H, E, L, or O and return all of the new atomized strings.

castower commented 4 years ago

@lecy thanks so much for the detailed response! It really helped me understand what's going on "behind the scenes". I agree, the tidyverse "masks" a lot of the details when I try to follow exactly what is going on. Thanks again!

@sunaynagoel thanks for the word() tip. I had not tried that function yet, but it's very useful!