Closed lindsayplatt closed 2 years ago
Just wanted to note here Jordan's comment from 2/4 in our chat:
Hopefully it is just those MN ones that are empty though. We could do something special for those lakes that we'd document and explain carefully. Like use an average of the cells around the lake or use the cell closest to the centroid.
Quoting Jordan again from an email explaining why they are missing. Woah!
They are missing from the drivers because they are big enough to be part of the water mask used for the climate models
Okay I've done some work to wrap my head around the issue and possible approaches.
All queried cells are in dark grey on this map. The queried cells that are missing data are in red. There are 10 total, across 2 tiles. There are a total of 123 lakes within these cells. 21 of those lakes are modelable by GLM. The cell with the thick red border (cell_no
= 17050
) has NaN
values for AirTemp, RelHum, Rain, and Snow, but not for Shortwave, Longwave, or WindSpeed. The other 9 cells have NaN
for all 7 variables:
To replace the missing data, we can work with either A) the pool of all the available GCM data on GDP, or B) the pool of all of the queried data that we have downloaded. The replacement would probably happen (if needed) within the munging step.
We would pursue (A) if we wanted to replace the data for each missing cell with the data from as many surrounding cells as possible (in light tan, the # of surrounding cells varies from 6-8):
But some of those cells themselves are NA cells:
In addition, pulling data for those surrounding cells would mean a new call to GDP. That in itself would require more time to re-download the data, and hard-coding in assumptions about which cells are missing data, and therefore which cells need to be added to the query.
For these reasons, Lindsay and I are planning to pursue B), which will work to generate data for the missing cells using the data for already queried cells.
For option (B) we could, as Jordan suggested, either use 1) an average of the queried cells that surround each cell with missing data (The # of surrounding cells varies from 1-5):
Or, per Jordan's other suggestion, we could 2) replace the data on a per-lake basis, using the data for the cell with the centroid that is closest to each lake centroid. Cell centroids that match this criteria are in green:
A few considerations:
lake_cell_tile_xwalk
is determined using a spatial join. If we pursue option (B2), we would be changing that crosswalk information for the 123 lakes that fall into cells with missing data, so would need to update that information. I think to do so we'd need to add (before the munging step):
17050
, above) where only some variables are missing? Do we treat that cell like the other cells that are missing data, or leave it as is? Or do we somehow replace the data for only missing variables using data from surrounding cells?Great explanation of the problem and options at hand. I wonder if we can attempt a slightly different approach to where we insert the steps to manage missing data that make it easier by breaking them down into targets after the munge step. My proposal would be to have a way to identify the "missing cells" and then adjust the lake to cell xwalk to reflect which cell's file a lake from a missing cell should use instead. I don't have a good sense for what to do about 17050, except that it feels weird to replace some but not all of the driver data ...
Here's some more detail on this idea:
munge_notaro_to_glm()
to check if the file has any non-missing values at the very end (or may > 75% missing in order to catch cell 17050
) and save an empty data.table if that is the case:
df <- tibble(col1 = rep(NA, 5), col2 = rep(NaN, 5), col3 = rep(NaN, 5))
col_is_na <- apply(is.na(df), 2, all)
if(sum(col_is_na) / length(col_is_na) > 0.75) { df <- tibble() }
2. Add a target that identifies any of the empty `glm_ready_gcm_data_feather`, e.g. `file.size(glm_ready_gcm_data_feather) == 0`
3. Add a target that creates an adjusted `lake_to_cell_xwalk` editing any of the cells identified in the previous target as "empty" and mapping the lake to a different cell (I am leaning towards finding the closest cell centroid to the lake).
Then, when the modeling code goes to grab driver data for a lake, the lake is already mapped to a cell with non-missing data.
I think that's a nice approach except that I'm not sure 1 and 2 would work given the current mapping. Currently, the data read in in munge_notaro_to_glm()
is per tile and GCM, so contains data for many cells, and is then written to a _munged.feather
file that is also per tile and GCM. So we wouldn't want to write an empty file if only some of the cells for that tile are missing data.
For step 3, I'd also lean toward finding the closest cell centroid to the lake. And I agree with you about it feeling odd to replace some but not all of the driver data for 17050.
Okay discussed more in chat, and Lindsay's approach and mine proposed in the 2nd bullet under considerations are essentially the same, in that they both focus on not changing any of the actual data, but instead adjusting the xwalk based on which cells are missing data.
@hcorson-dosch plotted the raw NetCDF files and found that these cells have missing data (which matches the missing data in our updated here). Need a solution because otherwise, we won't have driver data for Lake of Woods, Upper Red lake, or Lower Red Lake.