USGS-R / hypeRusgs

Creating some R hype by showing off advanced techniques applied to USGS science.
3 stars 5 forks source link

Writing data files #8

Closed mlperickson closed 6 years ago

mlperickson commented 6 years ago

Hello -- I'd appreciate your recommendations for writing files from R to maintain the structure of the columns. I've been using 'write.csv' -- but the SITEIDs get messed up. Thanks.

ldecicco-USGS commented 6 years ago

There are a few things to think about. Here's some sample data:

x <- data.frame(site = c("09499000","09499001"),
                date = as.Date(c("2018-01-01","2018-01-02")),
                value = c(1,3),
                stringsAsFactors = FALSE)

write.csv(x, "base.csv", row.names = FALSE)

If you open "base.csv" in a text editor, the leading 0's are in the file. If you open "base.csv" in Excel....they're gone! If you don't "save changes?" when you close, you don't lose them, but if you do...they're just gone.

The first is a non-R issue...

  1. Don't open and edit your csv in Excel. Using write.csv (and a few other options that I'll mention), if you open the file in a text editor (Notepad++ is my favorite), the leading 0's are in the file. If you open the csv in Excel, Excel will strip those 0's. If you save the csv then from Excel..they are gone.
    1. Maybe Excel can't be avoided...then you need to get a single ' to lead the site number. It's a total quirk of Excel, and as far as I know, there's not a nice easy argument in any of the file writing functions to do that. But...you can do it like this:
      x$site <- paste0("'",x$site)
      write.csv(x, "base.csv", row.names = FALSE)

      Now when you open the csv in Excel, the column will be a string, and the leading zero's will be there.

Other options... If you've jumped into R, you could save files with an R binary format. This will have a much smaller memory footprint, and maintains the column types (so, if you've done work to make dates right, maybe fancy factors or something...those will be the same when you open the "RDS" file:

saveRDS(x, "test.rds")
y_open <- readRDS("test.rds")

There are other file formats that might be worth exploring if your files are really big (fst, feather)

If your files are pretty big....and you want to stay in the "csv" world, there are 2 exceptionally faster and more robust options:

library(readr)
write_csv(y, "testreadr.csv")
y_readr <- read_csv("testreadr.csv")

(notice it's write_csv not write.csv). The readr package does speed up the read and write time considerably. It's also pretty good at picking out what the columns should be (dates, numeric, etc). By default, strings are imported as strings, not factors.

The other option is:

library(data.table)
fwrite(y, "test.csv")
y_dt <- setDF(fread("test.csv"))

The fwrite and fread functions from the data.table package are easily the FASTEST functions in this discussion. So, if you find yourself waiting more than 10 seconds to write the file, you could switch over to these functions and have that down to milliseconds. fread is also quite good at getting the correct column types. By default, this function imports the data into a data.table object which is different than a data.frame. That is why you would need to wrap the fread in the setDT...and also why we don't explicitly teach it in the Intro to R classes.

ldecicco-USGS commented 6 years ago

My goodness, I forgot another obvious answer. If you have to share files that are going to be opened and maybe edited in Excel, you can also save the file as an Excel file with the openxlsx package:

library(openxlsx)
x <- data.frame(site = c("09499000","09499001"),
                date = as.Date(c("2018-01-01","2018-01-02")),
                value = c(1,3),
                stringsAsFactors = FALSE)
write.xlsx(x, file = "test.xlsx")

This will automatically set string columns as strings.

ldecicco-USGS commented 6 years ago

So, to open an Excel file:

library(readxl)
y <- read_xlsx(path = "test.xlsx")
ldecicco-USGS commented 6 years ago

Covered in the 8/29 lesson