edquant / edh7916

Course materials and website for EDH7916: Contemporary Research in Higher Education
https://edquant.github.io/edh7916/
3 stars 1 forks source link

Only reading some columns when reading in data file #59

Open btskinner opened 1 year ago

btskinner commented 1 year ago

Some of you are using very large data files for your project (I love this!), but they can be very difficult, slow, or — if your computer memory is too small — impossible to read in. Luckily, you can select columns while reading the file, meaning you don't have to read in the entire file. Here are three methods.

CSV file

Let's say that from hsls_small.csv, we only want the unique ID and the three round 4 (x4) variables. Here are three versions that give the same output. Each uses the argument, col_select(), within read_csv() to select those columns. The three versions:

  1. Selects each column directly by name using c()
  2. Stores the vector of column names in a separate object before read_csv() and then uses that object name with the any_of() helper. This is nice if you want to (re)use a vector of column names.
  3. Uses a tidy-select function to shorten the call, in this case starts_with() since all round 4 vars start with x4

You don't need to use them all — this is just to show a variety of ways of accomplishing the same goal.

## library
library(tidyverse)

## path to data (assumes were in the scripts directory like in class)
dat_dir <- file.path("..", "data")

## (1) select names directly and individually
df_1 <- read_csv(file.path(dat_dir, "hsls_small.csv"),
               col_select = c("stu_id",
                              "x4hscompstat",
                              "x4evratndclg",
                              "x4hs2psmos"))
#> Rows: 23503 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): stu_id, x4hscompstat, x4evratndclg, x4hs2psmos
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## (2) select names using object and all_of() helper
selected_cols <- c("stu_id", "x4hscompstat", "x4evratndclg", "x4hs2psmos")
df_2 <- read_csv(file.path(dat_dir, "hsls_small.csv"),
                 col_select = all_of(selected_cols))
#> Rows: 23503 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): stu_id, x4hscompstat, x4evratndclg, x4hs2psmos
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## (3) select names using tidyr helpers
df_3 <- read_csv(file.path(dat_dir, "hsls_small.csv"),
                 col_select = c("stu_id", starts_with("x4")))
#> Rows: 23503 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): stu_id, x4hscompstat, x4evratndclg, x4hs2psmos
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## show they are the same
identical(df_1, df_2)
#> [1] TRUE
identical(df_2, df_3)
#> [1] TRUE

## show
df_1
#> # A tibble: 23,503 × 4
#>    stu_id x4hscompstat x4evratndclg x4hs2psmos
#>     <dbl>        <dbl>        <dbl>      <dbl>
#>  1  10001            1            1          3
#>  2  10002            1            1          3
#>  3  10003            1            1          4
#>  4  10004            1            0         -7
#>  5  10005            1            0         -7
#>  6  10006           -8           -8         -8
#>  7  10007            1            1          2
#>  8  10008            1            1          3
#>  9  10009            1            1          8
#> 10  10010           -8           -8         -8
#> # … with 23,493 more rows

Created on 2023-03-08 with reprex v2.0.2

Stata file

You can use the same setup to read in selected columns from a Stata (dta) file (similarly with SPSS files). The only difference from above is that we use the haven library and read_dta() to read in hsls_small.dta.

## library
library(haven)

## path to data (assumes were in the scripts directory like in class)
dat_dir <- file.path("..", "data")

## (1) select names directly and individually
df_1 <- read_dta(file.path(dat_dir, "hsls_small.dta"),
                 col_select = c("stu_id",
                                "x4hscompstat",
                                "x4evratndclg",
                                "x4hs2psmos"))

## (2) select names using object and all_of() helper
selected_cols <- c("stu_id", "x4hscompstat", "x4evratndclg", "x4hs2psmos")
df_2 <- read_dta(file.path(dat_dir, "hsls_small.dta"),
                 col_select = all_of(selected_cols))

## (3) select names using tidyr helpers
df_3 <- read_dta(file.path(dat_dir, "hsls_small.dta"),
                 col_select = c("stu_id", starts_with("x4")))

## show they are the same
identical(df_1, df_2)
#> [1] TRUE
identical(df_2, df_3)
#> [1] TRUE

## show
df_1
#> # A tibble: 23,503 × 4
#>    stu_id x4hscompstat x4evratndclg x4hs2psmos
#>    <chr>  <dbl+lbl>    <dbl+lbl>    <dbl+lbl> 
#>  1 10001   1 [Diploma]  1 [Yes]      3        
#>  2 10002   1 [Diploma]  1 [Yes]      3        
#>  3 10003   1 [Diploma]  1 [Yes]      4        
#>  4 10004   1 [Diploma]  0 [No]      NA        
#>  5 10005   1 [Diploma]  0 [No]      NA        
#>  6 10006  NA           NA           NA        
#>  7 10007   1 [Diploma]  1 [Yes]      2        
#>  8 10008   1 [Diploma]  1 [Yes]      3        
#>  9 10009   1 [Diploma]  1 [Yes]      8        
#> 10 10010  NA           NA           NA        
#> # … with 23,493 more rows

Created on 2023-03-08 with reprex v2.0.2

Very large CSV file

If you have a really big CSV file, you might benefit from using the data.table library. It has its own system of data manipulation like tidyverse, but can be much faster in some ways. Since we're focused on tidyverse, we'll only use the fread() function and then covert the data.table object to a tidyverse tibble at the end. Notice that while the arguments are different, the basic set up for selecting only some columns is the same.

## library
library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#> 
#>     hour, isoweek, mday, minute, month, quarter, second, wday, week,
#>     yday, year
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

## path to data (assumes were in the scripts directory like in class)
dat_dir <- file.path("..", "data")

## select names directly and individually
df <- fread(file.path(dat_dir, "hsls_small.csv"),
            select = c("stu_id",
                       "x4hscompstat",
                       "x4evratndclg",
                       "x4hs2psmos"))

df
#>        stu_id x4hscompstat x4evratndclg x4hs2psmos
#>     1:  10001            1            1          3
#>     2:  10002            1            1          3
#>     3:  10003            1            1          4
#>     4:  10004            1            0         -7
#>     5:  10005            1            0         -7
#>    ---                                            
#> 23499:  35202           -8           -8         -8
#> 23500:  35203            1            0         -7
#> 23501:  35204            1            1          3
#> 23502:  35205            1            1         10
#> 23503:  35206            1            1          2

## convert to tibble (so we can use tidyverse)
df <- df %>% as_tibble()

df
#> # A tibble: 23,503 × 4
#>    stu_id x4hscompstat x4evratndclg x4hs2psmos
#>     <int>        <int>        <int>      <int>
#>  1  10001            1            1          3
#>  2  10002            1            1          3
#>  3  10003            1            1          4
#>  4  10004            1            0         -7
#>  5  10005            1            0         -7
#>  6  10006           -8           -8         -8
#>  7  10007            1            1          2
#>  8  10008            1            1          3
#>  9  10009            1            1          8
#> 10  10010           -8           -8         -8
#> # … with 23,493 more rows

Created on 2023-03-08 with reprex v2.0.2

btskinner commented 1 year ago

FYI: @edquant/team_everyone