NEONScience / NEON-utilities

Utilities and scripts for working with NEON data. Currently: an R package with functions to join (stack) the month-by-site files in downloaded NEON data, to convert data to geoCSV format, and to download data from the API.
GNU Affero General Public License v3.0
57 stars 36 forks source link

Reduce disk space requirement for eddy covariance download #131

Open s-kganz opened 5 months ago

s-kganz commented 5 months ago

Is your feature request related to a problem? Please describe.

Requesting several seasons of eddy covariance data can require a large amount of storage because all levels of the data product are bundled together. In my case, I only work with the level 4 products. Downloading the raw data takes about 60 GB of storage on disk before stacking. After running stackEddy, I am left with a 33 MB table with the NSAE data and QC flags I care about.

This discourages reproducibility because 1) downloading takes a long time, 2) it is antisocial to download tens of GB to a collaborator's machine, and 3) it encourages hosting a processed data table outside of NEON to get around (1) and (2).

Describe the solution you'd like

The optimal solution would be to just have users download eddy covariance data in FLUXNET format. I know this partially exists already on the Ameriflux data portal, but many sites don't have any FLUXNET-formatted data. This happens to affect my main study site (WREF), so here I am (note this also means I have to run REddyProc myself with potentially different settings than what site managers would prefer).

Another option is to download only the desired data level. But, I imagine this would require backend changes to the API that are not feasible.

A third option is to modify the zipsByProduct -> stackEddy workflow to operate one site-month at a time instead of processing all site-months together as done in this tutorial. This works, but deleting files is error-prone (unlink doesn't even raise a warning if it fails) and you still have to wait for 60 GB to download.

Describe alternatives you've considered

Right now I'm running zipsByProduct and stackEddy one site-month at a time, deleting any intermediate products along the way so that only ~250 MB of disk space is needed at any one time. A brief reprex:

library(neonUtilities)
library(foreach)
library(dplyr)

tdir <- tempdir()
fpath <- file.path(tdir, "filesToStack00200")

# Download five site-months of H20/CO2 NSAE
site_mos <- paste0("2019-0", seq(5, 9))

vars <- c(
  "timeBgn", "timeEnd",
  "data.fluxCo2.nsae.flux",
  "qfqm.fluxCo2.nsae.qfFinl",
  "data.fluxH2o.nsae.flux",
  "qfqm.fluxH2o.nsae.qfFinl"
)

wref_nsae <- foreach(sm=site_mos, .combine=rbind) %do% {
  zipsByProduct(
    "DP4.00200.001",
    site="WREF",
    startdate=sm,
    enddate=sm,
    savepath=tempdir(),
    check.size=FALSE
  )

  myeddy <- stackEddy(fpath)[["WREF"]] %>%
    select(all_of(vars))

  unlink(fpath, recursive=TRUE)
  stopifnot(!dir.exists(fpath))

  return(myeddy)
}

With my machine/internet this takes about 2 hours to download all the flux data I work with.

Additional context

I think this package is filling a really important role in the research community. I'd love to be able to write a paper and have a script linked that will run the entire analysis all the way through generating figures that appear in the manuscript. Having more flexibility in how flux data are downloaded would make this goal much more achievable.

cklunch commented 4 months ago

@s-kganz Thanks for your suggestions! As you noted, this is a challenge rooted in the way the eddy covariance files are stored, and there are limited options within neonUtilities itself. For your use case, I think your script for iterating over the files to be downloaded and deleting as you go is the best option available. Also, keep an eye on Ameriflux for reformatted files to appear there.

And I do expect that eventually we'll have more options for working with the H5 files, or more options for file formatting, but at this point I can't give an estimated timeline, we're still in the exploration phase. We've been experimenting with cloud-based methods for working with H5 files, which would avoid download, and we've talked about possible file format alternatives. I'll post updates here, but it may be a while.

s-kganz commented 4 months ago

Thanks for your comments @cklunch! I'm glad this is on your radar, and I look forward to hearing any updates.