DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Figure out a solution for cells with missing data #296

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

@hcorson-dosch plotted the raw NetCDF files and found that these cells have missing data (which matches the missing data in our updated here). Need a solution because otherwise, we won't have driver data for Lake of Woods, Upper Red lake, or Lower Red Lake.

image

hcorson-dosch-usgs commented 2 years ago

Just wanted to note here Jordan's comment from 2/4 in our chat:

Hopefully it is just those MN ones that are empty though. We could do something special for those lakes that we'd document and explain carefully. Like use an average of the cells around the lake or use the cell closest to the centroid.

lindsayplatt commented 2 years ago

Quoting Jordan again from an email explaining why they are missing. Woah!

They are missing from the drivers because they are big enough to be part of the water mask used for the climate models

hcorson-dosch-usgs commented 2 years ago

Okay I've done some work to wrap my head around the issue and possible approaches.

All queried cells are in dark grey on this map. The queried cells that are missing data are in red. There are 10 total, across 2 tiles. There are a total of 123 lakes within these cells. 21 of those lakes are modelable by GLM. The cell with the thick red border (cell_no = 17050) has NaN values for AirTemp, RelHum, Rain, and Snow, but not for Shortwave, Longwave, or WindSpeed. The other 9 cells have NaN for all 7 variables: image

To replace the missing data, we can work with either A) the pool of all the available GCM data on GDP, or B) the pool of all of the queried data that we have downloaded. The replacement would probably happen (if needed) within the munging step.

We would pursue (A) if we wanted to replace the data for each missing cell with the data from as many surrounding cells as possible (in light tan, the # of surrounding cells varies from 6-8): image

But some of those cells themselves are NA cells: image

In addition, pulling data for those surrounding cells would mean a new call to GDP. That in itself would require more time to re-download the data, and hard-coding in assumptions about which cells are missing data, and therefore which cells need to be added to the query.

For these reasons, Lindsay and I are planning to pursue B), which will work to generate data for the missing cells using the data for already queried cells.

For option (B) we could, as Jordan suggested, either use 1) an average of the queried cells that surround each cell with missing data (The # of surrounding cells varies from 1-5): image

Or, per Jordan's other suggestion, we could 2) replace the data on a per-lake basis, using the data for the cell with the centroid that is closest to each lake centroid. Cell centroids that match this criteria are in green: image

A few considerations:

lindsayplatt commented 2 years ago

Great explanation of the problem and options at hand. I wonder if we can attempt a slightly different approach to where we insert the steps to manage missing data that make it easier by breaking them down into targets after the munge step. My proposal would be to have a way to identify the "missing cells" and then adjust the lake to cell xwalk to reflect which cell's file a lake from a missing cell should use instead. I don't have a good sense for what to do about 17050, except that it feels weird to replace some but not all of the driver data ...

Here's some more detail on this idea:

  1. Edit munge_notaro_to_glm() to check if the file has any non-missing values at the very end (or may > 75% missing in order to catch cell 17050) and save an empty data.table if that is the case:
    
    df <- tibble(col1 = rep(NA, 5), col2 = rep(NaN, 5), col3 = rep(NaN, 5))

Return a single T/F for whether all values in a column are NA or NaN

col_is_na <- apply(is.na(df), 2, all)

if(sum(col_is_na) / length(col_is_na) > 0.75) { df <- tibble() }

Then write out the df


2. Add a target that identifies any of the empty `glm_ready_gcm_data_feather`, e.g. `file.size(glm_ready_gcm_data_feather) == 0`
3. Add a target that creates an adjusted `lake_to_cell_xwalk` editing any of the cells identified in the previous target as "empty" and mapping the lake to a different cell (I am leaning towards finding the closest cell centroid to the lake).

Then, when the modeling code goes to grab driver data for a lake, the lake is already mapped to a cell with non-missing data.
hcorson-dosch-usgs commented 2 years ago

I think that's a nice approach except that I'm not sure 1 and 2 would work given the current mapping. Currently, the data read in in munge_notaro_to_glm() is per tile and GCM, so contains data for many cells, and is then written to a _munged.feather file that is also per tile and GCM. So we wouldn't want to write an empty file if only some of the cells for that tile are missing data.

For step 3, I'd also lean toward finding the closest cell centroid to the lake. And I agree with you about it feeling odd to replace some but not all of the driver data for 17050.

hcorson-dosch-usgs commented 2 years ago

Okay discussed more in chat, and Lindsay's approach and mine proposed in the 2nd bullet under considerations are essentially the same, in that they both focus on not changing any of the actual data, but instead adjusting the xwalk based on which cells are missing data.