Open sunaynagoel opened 4 years ago
@castower Can you send me the file by email please?
Note that in the dataset code01 and codedef01 tell you the subsectors if you want to identify them that way:
> head( dat )
ein orgname
1 311767271 NIA PERFORMING ARTS
2 463091113 THE YOUNG ACTORS GUILD INC
3 824331000 RUTH STAGE INC
4 823821811 STRIPLIGHT COMMUNITY THEATRE INC
5 911738135 NU BLACK ARTS WEST THEATRE
6 824668235 OLIVE BRANCH THEATRICALS INC
mission
1 a community based art organization that inspires, nutures,educates and empower artist and community.
2 we engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. we produce community theater productions for children as well as educational theater camps and workshops.
3 theater performances and performing arts
4
5
6 to produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.
code01 codedef01 code02 codedef02 orgpurposecharitable
1 A Arts, Culture, and Humanities A65 Theater 1
2 A Arts, Culture, and Humanities A65 Theater 0
3 A Arts, Culture, and Humanities A65 Theater 1
4 A Arts, Culture, and Humanities A65 Theater 1
5 A Arts, Culture, and Humanities A65 Theater 1
6 A Arts, Culture, and Humanities A65 Theater 0
@lecy thanks! I just sent over my RMD file. I will look into using the codes. -Courtney
@castower Just sent it back. A preview of one of the semantic networks:
Hello all,
So I have been working with the stringr functions a little more and I'm a bit confused what I'm doing wrong.
I have created the following test data set:
test <- c("hello my name is Courtney")
and I am trying to extract everything after 'hello' so that I can get an output of
my name is Courtney
However, when I run the following:
str_extract_all(test,"(?<=hello )\\S{0,}")
All that I'm getting is:
[[1]]
[1] "my"
Any tips on what I am doing wrong?
Also, as a note, I tried str_split instead and for some reason, it deletes 'my':
code:
test <- c("hello my name is Courtney")
str_split(test,"(?<=hello )\\S{0,}")
Output:
[[1]]
[1] "hello " " name is Courtney"
str_extract_all(test,"(?<=hello )\S{0,}")
@castower I am not sure what is wrong. But your question was interesting enough for me to try it out on my own. When I run this code, it eliminates all the 'h''e''l'and 'o' from the entire string. It is not treating 'hello' as a word. I tried \b as well but did not work
str_extract_all(test,"([^hello])")
[[1]] [1] " " "m" "y" " " "n" "a" "m" " " "i" "s" " " "C" "u" "r" "t" "n" "y"
But when I try word (),it works
word(test, 2,-1)
[1] "my name is Courtney"
It would be nice to write a function which can eliminate first word in every sentence.
I am going to be honest that when people immediately default to tidyverse packages I feel a little like an old man. Get off my lawn, Hadley Wickham!
I find that tidy packages are really great at scaling operations. Once you know how to do them, then they make it easier and faster to accomplish. They are not always great when you are just learning a new skill in R because they try to be clever and protect you from some of the complicated parts of the code, and they are written in a way that tries to generalize each step at scale. As a result, you lose some of the intuition about what is happening.
For example, group_by( f1, f2 ) %>% mutate( n=n() %>% ungroup()
is super easy and efficient to write, but behind the scenes the data is being split into many smaller datasets, variables are summarized on subsets, and then everything is recombined in a way that reconciles all of the dimensions correctly. The actual process is not obvious to the neophyte. I used to have to do all of the steps individually, so now I see how great that code is and how much time it saves.
So let me conclude this soap box by saying it is sometimes helpful to start with core R functions because they tend to operate at the most basic level, and can be helpful for understanding problems.
Your issue here, you want a process to remove the first word from each sentence. My question would be, what is your pseudocode. What do you mean by first word? Does it have to be "hello", or can it be any word? How do you operationalize the first word?
Try something like this:
> test <- c("hello my name is Courtney")
>
> # non-generalizable version - just remove hello
> gsub( "^hello ", "", test )
[1] "my name is Courtney"
>
> # > args( strsplit )
> # function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
> # split everything seperated by a space into distinct words
> # this is what "tokenization" does
> #
> x.split.list <- strsplit( test, " " )
> x.split.list
[[1]]
[1] "hello" "my" "name" "is" "Courtney"
>
>
> # extract the vector from the list
> x.split <- x.split.list[[ 1 ]]
> new.x <- x.split[ -1 ] # drop first word
> new.x
[1] "my" "name" "is" "Courtney"
>
> # combine vector elements back into a single string:
> # when you add collapse as an argument to paste it
> # mashes all elements of a vector into a single string
>
> paste0( new.x, collapse=" " )
[1] "my name is Courtney"
Note that parentheses in regular expressions are not like putting things in quotes. It actually atomizes the words in the parentheses into individual letters rather than isolating the specific word. So this expression:
gsub( "^hello ", "", test )
x.split.list <- strsplit( test, "[hello]" )
Would split all of the text by H, E, L, or O and return all of the new atomized strings.
@lecy thanks so much for the detailed response! It really helped me understand what's going on "behind the scenes". I agree, the tidyverse "masks" a lot of the details when I try to follow exactly what is going on. Thanks again!
@sunaynagoel thanks for the word() tip. I had not tried that function yet, but it's very useful!
Part 1
3
I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.
, , , , and [1] 3464
Even after running this code the command
it return the same result. , , , , and [1] 3464
Not sure, what is going wrong.