cboettig / neonstore

:package: A local content-based storage system for NEON data
https://cboettig.github.io/neonstore
Other
8 stars 5 forks source link

Excluding (or selecting multiple) tables in neon_download that do not fit a single regex expression #56

Closed rfiorella closed 2 years ago

rfiorella commented 2 years ago

Is it currently possible to exclude a table in neon_download? (Or select multiple tables in a single neon_download call?) A quick parsing of download_filters suggests this will not work without modifying and rebuilding neonstore, at least not in a single call to neon_download.

A potential use case would be excluding the 1 or 2 minute data tables for TIS data (temperature, RH, etc.) but retaining the 30 minute, EML, readme, sensor_position, and variables tables. Selecting only the 30 minute table using the table argument leads to issue stacking the data later on.

cboettig commented 2 years ago

Thanks for getting in touch. It should be possible to select only the tables (and metadata files) you want with the table regex. Consider this reprex:

``` r
library(neonstore)
neondir <- tempdir()
Sys.setenv("NEONSTORE_HOME"=neondir)
neon_download("DP1.00098.001", table = "sensor_position", site = "BART", start_date = "2022-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...
neon_download("DP1.00098.001", table = "readme", site = "BART", start_date = "2022-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...
neon_download("DP1.00098.001", table = "variable", site = "BART", start_date = "2022-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...
neon_download("DP1.00098.001", table = "*.xml", site = "BART", start_date = "2022-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...
neon_download("DP1.00098.001", table = "30min", site = "BART", start_date = "2022-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...

neon_index()
#> # A tibble: 6 × 15
#>   product     site  table type  ext   month timestamp           horizontalPosit…
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <dttm>                         <dbl>
#> 1 DP1.00098.… BART  RH_3… basic csv   2022… 2022-02-02 20:41:09                0
#> 2 DP1.00098.… BART  RH_3… basic csv   2022… 2022-02-02 20:41:09                3
#> 3 DP1.00098.… BART  EML   <NA>  xml   <NA>  2022-02-02 20:41:09               NA
#> 4 DP1.00098.… BART  read… <NA>  txt   <NA>  2022-02-02 20:41:09               NA
#> 5 DP1.00098.… BART  sens… <NA>  csv   <NA>  2022-02-02 20:41:09               NA
#> 6 DP1.00098.… BART  vari… <NA>  csv   <NA>  2022-02-02 20:41:09               NA
#> # … with 7 more variables: verticalPosition <dbl>, samplingInterval <chr>,
#> #   date_range <chr>, path <chr>, md5 <chr>, crc32 <chr>, release <chr>

Created on 2022-02-10 by the reprex package (v2.0.1)

I think you already knew that, so apologies if you were asking something else. Perhaps you want to negate the regex to avoid those multiple calls to neon_download()? I could see that might be helpful in this case, though generally it may be better for the user to be explicit?

I'm not entirely sure I follow what " Selecting only the 30 minute table using the table argument leads to issue stacking the data later on." means? Can you provide a reprex example showing what you see vs what you expect to see? you may need to be more explicit in your neon_read() or neon_store() commands to specify the table regex there...

e.g. in the above example, I might do something like neon_read("RH_30min-basic", site="BART"). That should not create trouble even if you have downloaded the 1 minute tables or have not downloaded the sensor positions or other metadata tables, I think? Am I overlooking something?

rfiorella commented 2 years ago

Hi Carl -

Here's the 'stacking' issue I mentioned:

library(neonstore)
library(neonUtilities)

neondir <- tempdir()
Sys.setenv("NEONSTORE_HOME"=neondir)

neon_download("DP1.00098.001", table = "30min", site = "WREF", start_date = "2021-01-01")
#>   comparing hashes against local file index...
#>   updating release manifest...

rh_data <- stackFromStore(neondir, dpID = "DP1.00098.001", site = "WREF", startdate = "2021-01")
#> Error in stackFromStore(neondir, dpID = "DP1.00098.001", site = "WREF", : Variables file not found; required for stacking. Re-download data, or download additional data, to get variables file.

Created on 2022-02-11 by the reprex package (v2.0.1)

As you noted, this issue can be resolved through multiple calls to neon_download() instead of trying to do it all in a single call as I had been.

I guess why I had been looking for a way to negate the regex here is that to retrieve the correct tables in this workflow, it requires the user to know what all the tables are beforehand (unless I'm missing some other function here?). To your point regarding explicitness though, this might allow for more unexpected behavior than requiring a user to use multiple neon_download() calls.

cboettig commented 2 years ago

@rfiorella thanks for the follow up! I see you are using stackFromStore from neonUtitilites here instead of the neonstore::neon_read(). As you see, stackFromStore attempts to stack all of the tables in the data product (including the 1min tables, it will warn but not error if it cannot find them).

rh_data <- stackFromStore(neondir, dpID = "DP1.00098.001", site = "WREF", startdate = "2021-01")
rh_30min_from_utilities <- rh_data$RH_30min %>% as_tibble()

By contrast neonstore does not involve first reading in all available tables and then subsetting them. Instead, the user reads in the desired table:

rh_30min <- neon_read(product = "DP1.00098.001", table = "30min", site = "WREF", start_date = "2021-01-01")

This is nearly identical to the table from neonUtilities. It lacks the column named "release", (which is "undetermined" in neonUtilities), and has the additional column "file", which can be helpful in preserving a provenance record noting from precisely what raw data object any given row of data has been read from. Also, this approach means that if you have happened to download the 1min data as well, it is not automatically read in, which can considerably to runtime and memory use, (not trivial issues given how large NEON data can get when working across the full time & space range!)

Your point about knowing what table to read is a good a one. If you stick with neonstore functions, there's no need to download those additional tables unless you are analyzing the data in them. If your code isn't ever calling rh_data$variables_00098 or rh_data$readme_00098`, then your code is not touching those tables and it did not need to download or read them in. From a purely computational perspective, that was just making things slower and using more RAM.

However, that does not mean these tables are unimportant from a scientific standpoint. neonUtilities reads these in automatically as a way of making this tables a bit more visible to the user (I believe). Still, it is always the scientist's job to figure out what, if anything, to do with that information. Even when you've read in all these tables, it can be pretty hard to figure out what you're actually looking at anyway.

My take with neonstore is that the long-form documentation (html & pdf pages on the NEON website) are really the best entrypoint for learning a way around a data product. Otherwise, the README is a good place to start. For some reason, neonUtilities treats READMEs like tables, but they aren't tabular data at all, they are just plain text for humans.

In neonstore, you can use neon_index() to work with individual files you have downloaded. Let's take a peak at the first README for this data product by merely opening it up in our local editor (e.g. opening the file in RStudio):

index <- neon_index(product = "DP1.00098.001", table = "readme", site = "WREF")
file.edit(index$path[[1]]) # 

Now we can read the readme as a plain text file :-), which tells us a bit more about the different tables.

rfiorella commented 2 years ago

@cboettig Ah, makes sense - thanks for the fantastic explanation of the differences between neon_read() and neonUtilities::stackFromStore(). I hadn't fully appreciated how they interact differently with the local store.