NCAR / GPEP

GNU General Public License v3.0
8 stars 6 forks source link

Allow missing values in input data #16

Open guoqiang-tang opened 5 months ago

guoqiang-tang commented 5 months ago

Currently, local regression does not allow missing values in input data. Global regression methods allow missing values. This problem can be addressed easily by searching for neighboring stations for each grid and each time step. This should be added later.

andywood commented 5 months ago

This functionality is basically gap-filling / imputation to achieve a degree of stationarity in the input network, which is not a completely trivial module to provide. The decision to require users to provide complete QCd/filled inputs was to encourage some recognition by the user that creating gridded datasets from patchy/incomplete input station is not advisable (eg what is done in Maurer/Livneh, in contrast to operational surface analysis groups). An alternative is not to bulk GPEP with a possibly extensive data imputation module, but rather to provide that functionality as a supporting processing tool that can be applied by the user if needed.

guoqiang-tang commented 5 months ago

Several people have asked me about the possibility of using incomplete input data, which is the main motivation for adding this ability in the future. My plan for this problem is much simpler than imputation. Here is the design:

(1) Add a parameter in the configuration file (e.g., incomplete_input_data = true or false)

(2) If incomplete_input_data is false (default value), GPEP will work in its current way, which finds its neighboring stations for each target point and saves the information to a netcdf file during the data processing step before spatial regression happens

(3) If incomplete_input_data is true, GPEP will find the neighboring stations for each target point during its spatial regression step. The search will be performed for every time step. So if there are missing values in input station data, the neighboring stations for a target point could be different for different time steps.

This design will mainly utilize functions in the near_stn_search.py. From the coding perspective, it is not a major change.

Regarding the potential downside of using this method, this includes slower speed, lower accuracy compared to using complete input data, inhomogeneity, etc. Those issues depend on many factors such as station density. Users have to judge by themselves if incomplete station data are used.

andywood commented 5 months ago

I think it is debatable whether we want to put this in. The downside is code/options bloat and encouraging a usage of GPEP that is known to be a thoughtless/uncontrolled approach to handling gaps in station data. The user has no idea which stations & timesteps are affected and the impact, unless even more code is written to output detailed records of the gap filling with metadata. If gap filling is an external routine, the user will put more thought into it and appropriately complex methods can be used, versus simpler kluges. I would favor putting a proper routine in if any routine goes in. Also, this is not a priority at this time for us, given other project work.

guoqiang-tang commented 5 months ago

I agree this is not a priority. That's why I created an issue here instead of making code change immediately. Regarding the benefits and downsides of allowing incomplete input data, this has become more like a scientific problem, going beyond the scope of the software development. Actually, if we include this ability in GPEP, GPEP will become a suitable tool to further investigate this problem for anyone who is interested and has the time. We can go back to this problem later.