WFU-TLC / flc_discussion_board

A repository for discussing questions and issues in the Data Analysis with R (FLC)
https://wfu-tlc.github.io/
0 stars 0 forks source link

Best way to open and read a folder of .txt files #12

Open smartjw opened 5 years ago

smartjw commented 5 years ago

Hello everyone,

I'm wondering if you have a preferred way to open and read multiple text files (without combining them into a single document). I've tried a few things with the readtext package and the korpus package, but I can't get either to work right. Any suggestions for tried and true methods?

medewitt commented 5 years ago

Yep, @smartjw !

First put all of the documents into a common folder. This method works when you have a bunch of data that is roughly structured and you want to read it all in and then do a row combine (basically tack on each item that has been read onto the bottom row and keep going)

library(tidyverse)
# This code looks for all of the files that are in the "text_folder" folder that have txt in the name
files_to_get <- list.files(path = "texts_folder", pattern = "txt", full.names = T")

combined_files <- map_dfr(files_to_get, read_delim, sep = ",")

Other way if you are trying to combine a bunch of just raw text together here is what I have done: (and by raw text together everything is truly in a single cell)

files <- list.files("my_folder", 
                    full.names = T, pattern = ".txt")
a <- ""
for(i in seq_along(files)){
  a <- concatenate(a, read_lines(file = files[[i]]))
}

Let me know directionaly which way you are going...

francojc commented 5 years ago

My go to is actually readtext. The readtext() function needs some unpacking, but is great for reading multiple files (of various types, including plain/ running text) and organizing them by file into a data.frame where each document is coded in the doc_id column of the resulting dataset.

The readtext() approach works good when the appropriate metadata is in the file names, and only the file names. If the files contain the metadata in a header or the document is an .xml (or other hierarchically structured dataset), you may need to take another avenue. In these cases I usually work out an approach for a single file to read and process the file, then create a function to perform this action and then use the purrr package's map() function to apply this function to each of the files that I want to read and process into R.

I provide a description and examples of both cases in a blog post: Curate language data (1/2): organizing meta-data.