lindsayplatt commented 2 years ago

@hcorson-dosch plotted the raw NetCDF files and found that these cells have missing data (which matches the missing data in our updated here). Need a solution because otherwise, we won't have driver data for Lake of Woods, Upper Red lake, or Lower Red Lake.

hcorson-dosch-usgs commented 2 years ago

Just wanted to note here Jordan's comment from 2/4 in our chat:

Hopefully it is just those MN ones that are empty though. We could do something special for those lakes that we'd document and explain carefully. Like use an average of the cells around the lake or use the cell closest to the centroid.

lindsayplatt commented 2 years ago

Quoting Jordan again from an email explaining why they are missing. Woah!

They are missing from the drivers because they are big enough to be part of the water mask used for the climate models

hcorson-dosch-usgs commented 2 years ago

Okay I've done some work to wrap my head around the issue and possible approaches.

All queried cells are in dark grey on this map. The queried cells that are missing data are in red. There are 10 total, across 2 tiles. There are a total of 123 lakes within these cells. 21 of those lakes are modelable by GLM. The cell with the thick red border (cell_no = 17050) has NaN values for AirTemp, RelHum, Rain, and Snow, but not for Shortwave, Longwave, or WindSpeed. The other 9 cells have NaN for all 7 variables:

To replace the missing data, we can work with either A) the pool of all the available GCM data on GDP, or B) the pool of all of the queried data that we have downloaded. The replacement would probably happen (if needed) within the munging step.

We would pursue (A) if we wanted to replace the data for each missing cell with the data from as many surrounding cells as possible (in light tan, the # of surrounding cells varies from 6-8):

But some of those cells themselves are NA cells:

In addition, pulling data for those surrounding cells would mean a new call to GDP. That in itself would require more time to re-download the data, and hard-coding in assumptions about which cells are missing data, and therefore which cells need to be added to the query.

For these reasons, Lindsay and I are planning to pursue B), which will work to generate data for the missing cells using the data for already queried cells.

For option (B) we could, as Jordan suggested, either use 1) an average of the queried cells that surround each cell with missing data (The # of surrounding cells varies from 1-5):

Or, per Jordan's other suggestion, we could 2) replace the data on a per-lake basis, using the data for the cell with the centroid that is closest to each lake centroid. Cell centroids that match this criteria are in green:

A few considerations:

Currently, the GDP data is being downloaded for each GCM for each tile, and then munged with this same mapping. That means that as the munging function currently stands, the data we have to work with for any given cell with missing data is the data from other queried cells in that same tile. With the current tile configuration, none of the cells with missing data fall on the edges of a tile, but if they did with a future configuration (or in an expanded footprint), option (B1) and potentially (B2) (depending on how it is coded) would be restricted to pulling replacement data from cells within only 1/2 of the surrounding area.
- To get around this, we would need to change the mapping for the munging step, which would in turn mean reading in larger amounts of data at once (e.g., for multiple tiles), which is not ideal, and which would also mean this step would be less parallelizable
Currently when it comes time to pull the meteo data for each lake for GLM modeling, we are depending on the 'lake_cell_tile_xwalk' to identify the correct meteo file for each lake in our model configuration table. That lake_cell_tile_xwalk is determined using a spatial join. If we pursue option (B2), we would be changing that crosswalk information for the 123 lakes that fall into cells with missing data, so would need to update that information. I think to do so we'd need to add (before the munging step):
- A target that determines if any of the cells are missing data, returning a set of cells that are not missing data
- A target that matches lakes to that subset of queried cells that do have data, based on the closest cell centroid to each lake (this again begs the question of mapping -- we could pursue dropping the tile mapping after the previous step, in order to be able to match across tiles if a cell with missing data falls on the edge of a tile). Lakes that fall into cells that are not missing data would match to the same cells that they are already matched to, but lakes that turned out to fall in cells that were missing data would instead be matched to another of the queried cells
- This would be the new lake-cell-tile crosswalk table that would be used in 'lake-temperature-process-models'
- Then the munging step would precede as is, but would exclude cells that are missing data, as those cells would no longer match to any lakes
How do we want to handle the case (like that for cell 17050, above) where only some variables are missing? Do we treat that cell like the other cells that are missing data, or leave it as is? Or do we somehow replace the data for only missing variables using data from surrounding cells?

lindsayplatt commented 2 years ago

Great explanation of the problem and options at hand. I wonder if we can attempt a slightly different approach to where we insert the steps to manage missing data that make it easier by breaking them down into targets after the munge step. My proposal would be to have a way to identify the "missing cells" and then adjust the lake to cell xwalk to reflect which cell's file a lake from a missing cell should use instead. I don't have a good sense for what to do about 17050, except that it feels weird to replace some but not all of the driver data ...

Here's some more detail on this idea:

Edit munge_notaro_to_glm() to check if the file has any non-missing values at the very end (or may > 75% missing in order to catch cell 17050) and save an empty data.table if that is the case:
```
df <- tibble(col1 = rep(NA, 5), col2 = rep(NaN, 5), col3 = rep(NaN, 5))
```

Return a single T/F for whether all values in a column are NA or NaN

col_is_na <- apply(is.na(df), 2, all)

if(sum(col_is_na) / length(col_is_na) > 0.75) { df <- tibble() }

Then write out the df


2. Add a target that identifies any of the empty `glm_ready_gcm_data_feather`, e.g. `file.size(glm_ready_gcm_data_feather) == 0`
3. Add a target that creates an adjusted `lake_to_cell_xwalk` editing any of the cells identified in the previous target as "empty" and mapping the lake to a different cell (I am leaning towards finding the closest cell centroid to the lake).

Then, when the modeling code goes to grab driver data for a lake, the lake is already mapped to a cell with non-missing data.

hcorson-dosch-usgs commented 2 years ago

I think that's a nice approach except that I'm not sure 1 and 2 would work given the current mapping. Currently, the data read in in munge_notaro_to_glm() is per tile and GCM, so contains data for many cells, and is then written to a _munged.feather file that is also per tile and GCM. So we wouldn't want to write an empty file if only some of the cells for that tile are missing data.

For step 3, I'd also lean toward finding the closest cell centroid to the lake. And I agree with you about it feeling odd to replace some but not all of the driver data for 17050.

hcorson-dosch-usgs commented 2 years ago

Okay discussed more in chat, and Lindsay's approach and mine proposed in the 2nd bullet under considerations are essentially the same, in that they both focus on not changing any of the actual data, but instead adjusting the xwalk based on which cells are missing data.

DOI-USGS / lake-temperature-model-prep

Figure out a solution for cells with missing data #296

Return a single T/F for whether all values in a column are NA or NaN

Then write out the df