DS4PS / cpp-527-spr-2020

Course shell for CPP 527 Foundations of Data Science II for Spring 2020.
http://ds4ps.org/cpp-527-spr-2020/
0 stars 1 forks source link

Loading local data in R #9

Open sunaynagoel opened 4 years ago

sunaynagoel commented 4 years ago

What is good way to load a local .txt to R for text analysis with making a working directory?

file.choose() forces me to choose a file every time I run the code.

I tried read.table() as well with full path but it changes the data in the file, not sure why.

sunaynagoel commented 4 years ago

Another question

How can I capture the output of this code into a new table with two columns (with headers- name, frequency ).

#Top 10 token without stemming
tokens %>% dfm( stem=F ) %>% topfeatures( )

The output is

natural organic products brands ingredients skincare fragrance 22 17 8 6 6 5 5 might using want 5 4 4

lecy commented 4 years ago

Regarding loading data, you would save the file to a specific directory, then set that as your working directory and use readLines() to read in text files.

setwd( "C:/Users/Documents/TextAnalysis" )  # wherever your file is located
x <- readLines( "dear_john_letter_1.txt", warn=FALSE )
lecy commented 4 years ago

The output of topfeatures(() is a numeric vector with name attributes. You can convert it to a data frame in a few ways.

You can create a new data frame and assign the vector names as one column and the vecture value as another:

tf <- tokens %>% dfm( stem=F ) %>% topfeatures( ) 
data.frame( name=names(tf), freq=tf, row.names=NULL )
           name freq
1       provide  251
2     community  209
3       support  156
4       mission  144
5     education  142
6         youth  125
7  organization  117
8   educational  114
9      children  104
10       school  100

Since tables convert nicely to data frames you can also double-cast the numeric vector:

as.data.frame( as.table( tf ) )
           Var1 Freq
1       provide  251
2     community  209
3       support  156
4       mission  144
5     education  142
6         youth  125
7  organization  117
8   educational  114
9      children  104
10       school  100

Note that topfeatures() outputs the first ten results. You can as for as many as you would like:

tf <- tokens %>% dfm( stem=F ) %>% topfeatures( n=100 )
sunaynagoel commented 4 years ago

@lecy thank you so much. These really help a lot.