CartoDB / crankshaft

CARTO Spatial Analysis extension for PostgreSQL
BSD 3-Clause "New" or "Revised" License
54 stars 20 forks source link

geographically weighted regression (gwr) #146

Closed TaylorOshan closed 6 years ago

TaylorOshan commented 7 years ago

This PR is for an initial draft of a function to wrap a pysal GWR function for use within crankshaft. It includes the base generalized linear modeling (glm) code, the base geographically weighted regression code (gwr) and an initial attempt to wrap the main GWR functionality for eventual use within crankshaft. One potential issue that @andy-esch and I spotted for building GWR within PL/Python is that the GWR outputs are dynamic, since there are k sets of n x 1 coefficient estimates, standard errors, and t-values, where k is the number of variables that a user inputs into the model. This means it is not possible to declare an n x 1 schema beforehand. So far we came up with the idea of stacking the k sets of results, which will then need to be unstacked via a join. So far the stacking has been implemented but not the unstacking.

I have put together a quick test example in this notebook that creates a mock query result that would eventually be used as input into the crankshaft GWR function.

TaylorOshan commented 7 years ago

Hey, @andy-esch, I think I was able to get the output in the format we discussed, which was zipped lists of value, but where there are values for each variable, it should be list of dicts containing (variable:value) pairs. We might need to change the final results schema that is currently numerics to varchar to accommodate the dictionaries, unless its going to change before it is output to a pg table.

TaylorOshan commented 7 years ago

This should complete the set of GWR features that I was working on. Now there is a gwr_predict function, which takes in the same input as GWR, but carries out prediction of the dependent variable for unsampled locations as @stuartlynn and I discussed. The way this works is that sampled observation points should be supplied in the usual manner (i.e., as a table with coordinates, dependent variable, independent variables) and then extra rows should be appended to the table for the unobserved points that we want to make predictions. Since we still need coordinates and independent variable values, the only difference between the sampled and unsampled point rows is that there is no dependent variable value available for the unsampled points. These should be Null in the postgres table. Then the gwr_predict function splits the sampled and unsampled points (referred to train or test sets in the code), eventually passing the sampled points to the main GWR model object and the unsampled points to the predict method of the GWR model object. The result is a table with a row pertaining to each unsampled point that includes a set of coefficients, a set of standard errors, a set of tvalues, a, a local R*2, a predicted value of the dependent variable, and a rowid. This functionality was tested by setting the PctBach values of the last 10 rows of the georgia datset to Null and using the following postgres command ``SELECT FROM cdb_crankshaft.CDB_GWR_PREDICT('select * from georgia_pred', 'pctbach', Array['pctpov', 'pctrural', 'pctblack']);. In this case, I had made a copy of thegeorgiatable calledgeorgia_pred`` before setting values to Null.

TaylorOshan commented 7 years ago

Hey @andy-esch , know I said this was finished, but just made a small change so that the output now includes "corrected" or filtered t values. That is, a more conservative threshold is set based on specialized GWR diagnostics, rather than the classic threshold (1.96<>-1.96 for 95% CI) and then all t values within this new more conservative (larger) threshold are all set to zero. This extra output column is useful for quantifying uncertainty in the GWR estimates. I was using in my carto maps by setting all spatial units with filtered_t_val (or ct_variable_name) == 0 to grey.