DOI-USGS / lake-temperature-process-models

Creative Commons Zero v1.0 Universal
1 stars 4 forks source link

Get GLM output into netCDF DSG format #31

Open hcorson-dosch-usgs opened 2 years ago

hcorson-dosch-usgs commented 2 years ago

Currently the GLM output is being stored in feather files, with one feather file per lake-gcm combo (6 files per lake). For sharing on sciencebase, we are currently (per #20) zipping these feather files together by tile number (4 zip files in total).

Per Jordan comments,

I'd like to propose a new data release format that uses netcdf discrete sampling geom to put all of the lakes in a single file, like we did with Jared's EA-LSTM data release. Do do so, we'd need to address one challenge with depth but there are options for that.

we'd like to move to storing the output in netCDF DSG format. As GLM generates temperature profiles, this would mean adding another dimension for depth. That is not currently supported by the write_timeseries_dsg() function of ncdfgeom, but I will submit an issue there to see if it would be within scope of that function to add that functionality.

hcorson-dosch-usgs commented 2 years ago

Issue submitted here asking if it would be within scope of write_timeseries_dsg() to add an option for a 3rd netCDF dimension (in our case, depth beneath the lake surface)

hcorson-dosch-usgs commented 2 years ago

Starting to think about this again.

Variables:

Coordinates:

Dims:

hcorson-dosch-usgs commented 2 years ago

Okay I regrouped w/ @jread-usgs on this, as it is again a priority. Here are some notes:

Re: how to group the output into netCDFs

Re: resolution of data storage

Re: inclusion of ice flags alongside temperature predictions

Plan for development

Working locally w/ subset of 1-5 lakes...

  1. Write ~depth = 0 temperature predictions~ ice flags for a single GCM to netCDF using write_timeseries_dsg() (2D)
  2. Modify created netCDF to add a new variable, temperature, with ~two~ one additional dimension~s~: depth (based on max depth of all lakes)~, and GCM name~
  3. Write ~remaining~ temperature predictions (for all depths ~and other GCMs~) to netCDF
  4. ~Write ice flags (for each GCM) to netCDF~ Assess likely size if scaled
  5. Send draft netCDF to Jordan for compression
  6. Based on compressed netCDF size, re-assess approach. If seems too big,
  7. Explore reducing resolution of predictions at depth
  8. Send draft netCDF to Jordan for compression
  9. Based on size, determine if worth pursuing adding GCM dimension.....
  10. Explore adding GCM dimension
  11. Evaluate approach based on compressed netCDF size

Then if all is working, test w/ all MN predictions

hcorson-dosch-usgs commented 2 years ago

Ok - @lindsayplatt , @jread-usgs. I wanted to provide an update here of my progress before I'm out for two weeks. This has been a back-burner item for some months now, but I did make some significant progress when other sprint tasks were completed or blocked.

For both the GCM and NLDAS predictions, I've completed steps 1 - 4, 6, and 7 (see previous comment). I skipped step 5 b/c it was immediately apparent that the file size was going to be too large without some reduction in the resolution of predictions at depth.

All of my code is in this branch on my fork.

Here's a summary:

NetCDF dimensions

The code generates 3D netCDF files. The ice flags are stored as a 2D TimeSeries variable (with dimensions of site_id and time), while the temperature predictions are stored as a 3D TimeSeriesProfile variable (with dimensions of site_id, time and depth): image

Reduction of resolution of predictions at depth.

The GLM output predictions are at 0.5m intervals. If we store all predictions at all depths, the netCDF depth dimension becomes very long, and we store many many NA values for shallow lakes. Currently I am reducing the resolution of predictions at depth prior to packaging the predictions in the netCDF. For example, for the NLDAS netCDF, the depths are defined (in a somewhat hacky way for now) here, based on Andy's depths. The predictions are then subset to those depths here.

Testing netCDF build on Tallgrass

I went so far as to test the generation of the GCM netCDFs and NLDAS netCDF on Tallgrass with a subset of 1000 sites e.g., for NLDAS. Uncompressed, the NLDAS netCDF (with ice flag and temperature predictions for 1000 sites, at a restricted set of depths) is 3.8gb. The GCM netCDFs are each 5.4gb.

Testing netCDF compression on Tallgrass

Jordan noted that the nco and netcdf packages might be available as modules on Tallgrass, and they are. I loaded those modules alongside singularity and slurm modules, and then tried building the NLDAS netcdf with compression (switched this arg to TRUE, commented out these lines), but I got this error, which suggests that the system commands weren't able to be called by R within the container. I then tried running the system commands (ncks -h --fl_fmt=netcdf4 --cnk_plc=g3d --cnk_dmn time,10 --ppc temp=.2#ice=1 GLM_NLDAS_uncompressed.nc GLM_NLDAS.nc) directly on an allocated node (but NOT in the singularity container) and was able to compress the netCDF. So it seems like it should be possible, just may take some troubleshooting to ensure that nco and netCDF are accessible by R in the container. The compressed NLDAS netCDF file (w/ ice flags and preds for 1000 sites) is 297mb. One of the GCM netCDFs (w/ ice flags and preds for 1000 sites) is 422mb when compressed.

Testing extracting the predictions from the netCDF file

I did modify Dave's [read_timeseries_dsg() code]() so that I could extract results from the 3D netCDF files. That code is in a script here - detached from the pipeline for now. The code runs for the netCDF files I generated locally w/ a small # of sites, but I just tested it for the NLDAS netCDF I generated for 1000 sites on Tallgrass and the nc_meta::nc_meta() function returned an error 😕, so I'll have to troubleshoot that when I return.

When I'm back I'd be happy to test building and compression a full NLDAS netCDF with predictions for all of the sites (at restricted depths)

lindsayplatt commented 2 years ago

Great summary for capturing the current state of this work. Looking forward to chatting when you get back 🌴

hcorson-dosch-usgs commented 1 year ago

Quick update - Anthony was interested in this netCDF code briefly a couple of months ago, and in re-running my test scripts locally to refresh my memory it turned out that that nc_meta error that I was getting when testing my extraction code was just an issue with the package and is fixed if you install the development version from devtools.

lindsayplatt commented 1 year ago

I am testing the scaling of this code and approach beyond 1000 lakes on Tallgrass:

tar_make(p3_nldas_glm_uncalibrated_nc) took 8.8 min to build 1000 lakes and the nc file was 3.56 GB (pre-compression, which is a manual step). ~When I scaled up to all 12,688 lakes, it took XX min to build and the resulting nc file was XX GB.~

The job failed after 12.6 hrs with these messages BUT when I try to see the ones that failed, I get nothing

Error:
! problem with the pipeline.
Execution halted
srun: error: dl-0001: task 2: Exited with exit code 1
• built target p3_nldas_glm_uncalibrated_nc
• end pipeline: 12.6 hours
Warning message:
In data.table::fread(file = database$path, sep = database_sep_outer,  :
  Stopped early on line 54797. Expected 18 fields but found 35. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.0415046296s|710c613aa5de3636|608|rds|local|vector|p2_nldas_glm_uncalibrated_runs||7.95|Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm. Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm|p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.04>>

Check the ones that errored:

> tar_meta(fields = error, complete_only = TRUE)
# A tibble: 0 × 2
# … with 2 variables: name <chr>, error <lgl>
hcorson-dosch-usgs commented 1 year ago

Ok the NLDAS netcdf for 5k lakes built in 1.5 hours and is 17.4 gb uncompressed. After compression it is 1.7gb.

hcorson-dosch-usgs commented 1 year ago

Ok the NLDAS netCDF for 10k lakes built in 5.3 hours and is 37.8 gb uncompressed. After compression it is 3.4gb.

lindsayplatt commented 1 year ago

That's a lot of hours but 3.4 gb is great! Will likely need to talk with Andy about splitting his 63k up, though.