Use more stable distribution for differentially uprating wages

When projecting future financial years, the project_to function does not apply the same wages growth figure to taxfilers at every point in the distribution. Salaries and wages are 'differentially uprated', with wages growing faster at the bottom and top of the distribution than in the middle. The shape of the curve, and therefore the uprate_factors that are used to project wages, are based on the historical distribution of wages growth. This distribution changes when new sample files are added, as there is an additional year of history over which to estimate the shape of the differential uprating curve.

The current procedure, executed when put-data.R is run, is:

Load all sample files up to and including the latest;
Drop observations in sample files that contain negative salaries or wages (Sw_amt);
For each financial year, calculate the average salary in each percentile of the salaries and wages distribution;
For each percentile of the salaries and wages distribution in each year, calculate the growth in average salaries and wages compared to the previous year;
For each percentile, calculate the average annual wages growth over the full period for which we have sample files;
For each percentile, calculate uprate_factor_raw, which is the percentile's average annual wage growth divided by the average annual wage growth for all percentiles; and
Smooth the distribution of uprate factors by running a LOESS regression (span = 0.4) on the uprate_factor_raw values, using the fitted values from this regression as the uprate _factor for each percentile.

Running this procedure using sample files up to and including 2016-17 produces uprate_factors that are quite different to those obtained by running this procedure without the 2016-17 file. This is arguably undesirable, as it means that the projected distribution of income (and therefore projected tax liabilities) alters considerably when the newer uprating factors are used.

I therefore propose that step 7 of the procedure above is amended to use linear regression with a quadratic term rather than LOESS regression. Every other step in the process will remain unchanged.

Using a parametric rather than non-parametric technique makes the uprate_factors more robust to weird trends in the tails of the distribution.

I will amend put-data.R to reflect this change.

HughParsonage / grattan

Use more stable distribution for differentially uprating wages #186