Open hcorson-dosch-usgs opened 2 years ago
Issue submitted here asking if it would be within scope of write_timeseries_dsg()
to add an option for a 3rd netCDF dimension (in our case, depth beneath the lake surface)
Starting to think about this again.
Variables:
Coordinates:
Dims:
Okay I regrouped w/ @jread-usgs on this, as it is again a priority. Here are some notes:
Working locally w/ subset of 1-5 lakes...
write_timeseries_dsg()
(2D)Then if all is working, test w/ all MN predictions
Ok - @lindsayplatt , @jread-usgs. I wanted to provide an update here of my progress before I'm out for two weeks. This has been a back-burner item for some months now, but I did make some significant progress when other sprint tasks were completed or blocked.
For both the GCM and NLDAS predictions, I've completed steps 1 - 4, 6, and 7 (see previous comment). I skipped step 5 b/c it was immediately apparent that the file size was going to be too large without some reduction in the resolution of predictions at depth.
All of my code is in this branch on my fork.
Here's a summary:
The code generates 3D netCDF files. The ice flags are stored as a 2D TimeSeries
variable (with dimensions of site_id
and time
), while the temperature predictions are stored as a 3D TimeSeriesProfile
variable (with dimensions of site_id
, time
and depth
):
The GLM output predictions are at 0.5m intervals. If we store all predictions at all depths, the netCDF depth dimension becomes very long, and we store many many NA values for shallow lakes. Currently I am reducing the resolution of predictions at depth prior to packaging the predictions in the netCDF. For example, for the NLDAS netCDF, the depths are defined (in a somewhat hacky way for now) here, based on Andy's depths. The predictions are then subset to those depths here.
I went so far as to test the generation of the GCM netCDFs and NLDAS netCDF on Tallgrass with a subset of 1000 sites e.g., for NLDAS. Uncompressed, the NLDAS netCDF (with ice flag and temperature predictions for 1000 sites, at a restricted set of depths) is 3.8gb. The GCM netCDFs are each 5.4gb.
Jordan noted that the nco
and netcdf
packages might be available as modules on Tallgrass, and they are. I loaded those modules alongside singularity
and slurm
modules, and then tried building the NLDAS netcdf with compression (switched this arg to TRUE
, commented out these lines), but I got this error, which suggests that the system commands weren't able to be called by R within the container. I then tried running the system commands (ncks -h --fl_fmt=netcdf4 --cnk_plc=g3d --cnk_dmn time,10 --ppc temp=.2#ice=1 GLM_NLDAS_uncompressed.nc GLM_NLDAS.nc
) directly on an allocated node (but NOT in the singularity container) and was able to compress the netCDF. So it seems like it should be possible, just may take some troubleshooting to ensure that nco
and netCDF
are accessible by R in the container. The compressed NLDAS netCDF file (w/ ice flags and preds for 1000 sites) is 297mb. One of the GCM netCDFs (w/ ice flags and preds for 1000 sites) is 422mb when compressed.
I did modify Dave's [read_timeseries_dsg() code]() so that I could extract results from the 3D netCDF files. That code is in a script here - detached from the pipeline for now. The code runs for the netCDF files I generated locally w/ a small # of sites, but I just tested it for the NLDAS netCDF I generated for 1000 sites on Tallgrass and the nc_meta::nc_meta()
function returned an error 😕, so I'll have to troubleshoot that when I return.
When I'm back I'd be happy to test building and compression a full NLDAS netCDF with predictions for all of the sites (at restricted depths)
Great summary for capturing the current state of this work. Looking forward to chatting when you get back 🌴
Quick update - Anthony was interested in this netCDF code briefly a couple of months ago, and in re-running my test scripts locally to refresh my memory it turned out that that nc_meta
error that I was getting when testing my extraction code was just an issue with the package and is fixed if you install the development version from devtools.
I am testing the scaling of this code and approach beyond 1000 lakes on Tallgrass:
tar_make(p3_nldas_glm_uncalibrated_nc)
took 8.8 min to build 1000 lakes and the nc file was 3.56 GB (pre-compression, which is a manual step). ~When I scaled up to all 12,688 lakes, it took XX min to build and the resulting nc file was XX GB.~
The job failed after 12.6 hrs with these messages BUT when I try to see the ones that failed, I get nothing
Error:
! problem with the pipeline.
Execution halted
srun: error: dl-0001: task 2: Exited with exit code 1
• built target p3_nldas_glm_uncalibrated_nc
• end pipeline: 12.6 hours
Warning message:
In data.table::fread(file = database$path, sep = database_sep_outer, :
Stopped early on line 54797. Expected 18 fields but found 35. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.0415046296s|710c613aa5de3636|608|rds|local|vector|p2_nldas_glm_uncalibrated_runs||7.95|Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm. Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm|p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.04>>
Check the ones that errored:
> tar_meta(fields = error, complete_only = TRUE)
# A tibble: 0 × 2
# … with 2 variables: name <chr>, error <lgl>
Ok the NLDAS netcdf for 5k lakes built in 1.5 hours and is 17.4 gb uncompressed. After compression it is 1.7gb.
Ok the NLDAS netCDF for 10k lakes built in 5.3 hours and is 37.8 gb uncompressed. After compression it is 3.4gb.
That's a lot of hours but 3.4 gb is great! Will likely need to talk with Andy about splitting his 63k up, though.
Currently the GLM output is being stored in feather files, with one feather file per lake-gcm combo (6 files per lake). For sharing on sciencebase, we are currently (per #20) zipping these feather files together by tile number (4 zip files in total).
Per Jordan comments,
we'd like to move to storing the output in netCDF DSG format. As GLM generates temperature profiles, this would mean adding another dimension for depth. That is not currently supported by the
write_timeseries_dsg()
function ofncdfgeom
, but I will submit an issue there to see if it would be within scope of that function to add that functionality.