Weighting issue - Githubissues

andydawson commented 9 years ago

I think there is an issues with weighting / kernel normalization when we switch to the coarse grids, and maybe even when we just split the domain. For the coarse grids, I am thinking we may need to scale the weights of the grid cells as a function of resolution.

andydawson commented 9 years ago

I've been thinking more about this weighting issue, and I think we should be scaling our weights as a function of grid resolution to keep the scaling in the prediction and calibration models similar. The way we currently have it, for each core I can compare the sum of the weights of the entire domain for each grid.

For the fine grid (8 km cells), medium grid (24 km cells) and coarse grid (40 km cells) I can summarize the sums of weights divided by the sum of the weights of the potential neighborhood, and get:

> summary(w_1by_umw / sum_w_pot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1727  0.5879  0.7028  0.6728  0.8015  0.8731 
> summary(w_3by_umw / sum_w_pot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01899 0.06432 0.07685 0.07398 0.08824 0.09601 
> summary(w_5by_umw / sum_w_pot)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.  
0.006804 0.024210 0.028860 0.027150 0.032280 0.034980

So for a core in the middle of the domain on the fine grid, we have veg data for 87 % of the contributing domain. For the coarser grids, we are off by a factor of 9 and 25, which makes sense (med grid has 9 fine cells per med cell, and coarse grid has 25 fine cells per coarse cell). To compare, after scaling I get

> summary(w_1by_umw / sum_w_pot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1727  0.5879  0.7028  0.6728  0.8015  0.8731 
> summary(w_3by_umw*9 / sum_w_pot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1709  0.5789  0.6917  0.6658  0.7942  0.8641 
> summary(w_5by_umw*25 / sum_w_pot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1701  0.6052  0.7215  0.6786  0.8070  0.8744

andydawson commented 9 years ago

Also looked into splitting the domain. The concern is that since we are not considering the vegetation data for part of the domain, the total sum of the weights of contributing cells will be smaller than it would be if we considered the entire domain (on which the model is calibrated). If this happens, the contribution of the local vegetation ($\gamma \, r(s(i))$) is effectively upweighted. I expect this issue to be most apparent at the boundary where we make the split.

To see how significant this issue is, I can compare the sum of the contributing weights for each core in both the partial and full domains. The proportions of the sum of weights in the partial domain relative to the full for the 3by grid (24 km by 24 km cells) are :

> summary(metaW$prop_w)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6215  0.8965  0.9399  0.9269  0.9898  0.9979 
> summary(metaE$prop_w)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4721  0.7090  0.9275  0.8245  0.9860  0.9998

Overall, this seems okay. As expected, we lose information near the split boundary (indicated by spatial plots of these relative proportions, not shown). This loss of information is also seen near the boundaries of the full domain, where the potential source area for pollen is partly outside of the domain.

Ideally we can run the model for the full domain, but I still think results from a split domain will be sufficient if need be.

andydawson / stepps-prediction

Weighting issue #3