Correct application of lifetime model

stefanpauliuk commented 6 years ago

For dynamic stock modelling we need to solve 3 problems: 1) Decide on how exactly to move from continuous time to discrete time intervals: Should outflow of inflow of current year already be included 2) Choose which functions to use: In Scipy and Excel, pdf and cdf have different discretization intervals (pdf(t) is for t +/- 0.5, while cdf(t) is measured at t.) 3) speedy computation

stefanpauliuk commented 6 years ago

Two effects overlap here:

First: In a discrete case the inflow happens instantaneously for each time interval, and the question is whether one allocates the outflow during the first year of residence of an inflow quantity to the current year, to the next year, or somewhere in between (possible reconciliation of the two approaches). Here a clear explanation and documentation of the discrete approach is needed, and this is lacking in the current code.

Second: Both Excel and scipy.stats.norm don’t give the continuous but the discrete pdf with time interval 1. That means they return the integral of the continuous pdf for 1 year [which explains why it is so slow in Python!], and the question is then: for the last year, for the coming year, or for +/- 0.5 years? To make it more complicated, the answer differs whether you take the discrete pdf or the cdf, (t +/- 0.5 years for pdf, -infty till t for cdf)! That difference is very likely also the reason why there is the cdf for Weibull in the current code but the pdf for normal, but doesn’t mean that both are consistent, needs to be checked.

So when ones combines effects 1 and 2 ones gets a bunch of different situations under- or overestimating the actual inflow of the continuous case. Like for the Riemann integral https://en.wikipedia.org/wiki/Riemann_integral

TomerFishman commented 6 years ago

I ran some tests and it seems that the speeds of the distribution functions in scipy behave in very interesting ways: a. the speed is almost identical for the pdf, cdf, or the sf (survival function, i.e. 1-CDF) of the distribution function. The speed differences between Weibull and normal are negligible. b. invoking scipy.stats.norm.cdf(year, loc=mean, scale=sd) is 10 times faster than invoking scipy.stats.norm(loc=mean, scale=sd).cdf(year), despite being the same function. This is true for Weibull as well as norm, and for pdf and sf as well as cdf. This may explain the vast differences in speed in the original DSM code. It’s a strange phenomenon but good to know. c. A nice surprise is that the speed hardly changes regardless of whether the input is a single number (i.e. a single year), or a numpy array of years which returns an array of pdf, cdf, or sf outputs. That means we can avoid loops to create survival curves and save a lot of time.

Please check the attached code to see what I mean (note that at least in spyder, %timeit can’t be run as a full script and has to be run as a line or selection of lines). I added my speed test results as comments, conducted on an i7 Windows 10 laptop with 16GB of memory.

Regarding the choice of discretization approach, there are several factors to consider. First, there’s actually no discrepancy between our approaches! We both adhere to the standard convolution formulation, and let the sum operator index run until the impulse response function f(x-y) reaches the value of f(0), not f(1) or f(-1). Obviously we should not change this, and so we need to work around the fact that pdf(0) and/or cdf(0) will be called by the convolution.

Second, the “big picture” context: if empirically measured outflow data statistics were available, they would be for the calendar year. Therefore the outflows we model should also be for the calendar year, meaning that the outflows labeled 2016 should be the interval 2016-2017, not the interval 2015-2016 or mid-point -/+0.5 years. This is also relevant for comparison with other statistics: we won’t be able to get GDP values for -/+0.5 years.

Third, I ran some simple tests comparing python’s normal and Weibull pdf to various deltas of the cdf: (cdf)-(cdf-1), (cdf+1)-(cdf), and (cdf+0.5)-(cdf-0.5). The first two have positive or negative offsets. The +/-0.5 delta approximates the pdf closely but is not exactly the same as the pdf output. This calls into question whether any of them is actually useful. The code for this is also included in the attached .py file.

Fourth, this may actually be a non-issue, because we are modeling approximate values and the “accuracy” of them is virtually meaningless – even if we have a handful of empirical statistics. to compare and calibrate with, these empirical results suffer from uncertainties, averaging, simplifications, rounding, measurement errors... So what does accuracy even mean in this case? Which of the two is underestimated or overestimated? I have a feeling that the “error” of using the -1 or +1 delta instead of the -0.5 delta would still be smaller than the error of the -1 delta to the empirical measurements. I would claim that our priority should rather be internal consistency, and for that – as usual – we should revert to the mass balance principal.

All these lead me to suggest the following recommendations:

For speed, optimization, and simplicity, the best option for creating a survival curve is to use the .sf flavor of scipy’s distribution functions by feeding it an array of years rather than using for loops.
there’s no reason to use delta cdf formulations: it doubles the calculation time (by calling .cdf twice) and none of the three options really achieve what we want.
instead of using pdfs for outflow cohorts, outflows should be calculated using the mass balance equation, with the delta stock calculated with .sf

TomerFishman commented 6 years ago

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

# a single year
year = 14
# an array of years
yeararray = np.arange(0,100,1)

# Normal distribution
mean = 60 #feel free to change the parameters
sd = 20

# test the processing time of scipy's normal distribution family:
# for a single year input:
%timeit scipy.stats.norm.pdf(year, loc=mean, scale=sd) #61.3 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.cdf(year, loc=mean, scale=sd) #53 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.sf(year, loc=mean, scale=sd) #51.9 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# compare with:
%timeit scipy.stats.norm(loc=mean, scale=sd).pdf(year) #576 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.norm(loc=mean, scale=sd).cdf(year) #558 µs ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.norm(loc=mean, scale=sd).sf(year) #565 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# clearly the first formulation is better

# for an array of years input:
%timeit scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd) #66.8 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) #56.6 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.sf(yeararray, loc=mean, scale=sd) #57.5 µs ± 947 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# compare with a for loop:
def normal_cdf(yearseries):
    cdfarray = np.zeros(len(yearseries))
    for i in yeararray:
        cdfarray[i] = scipy.stats.norm.cdf(yearseries[i], loc=mean, scale=sd)
    return cdfarray

# ensure that the for loop and the built-in treatment of numpy arrays provide the same results:
scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) == normal_cdf(yeararray)

%timeit normal_cdf(yeararray) #5.2 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# this is ~100 times slower than the built-in array treatment, which makes sense because the for loops 100 times.

# Weibull distribution
k = 2.5 #feel free to change the parameters
l = 50

# test the processing time of scipy's WEIBULL distribution family:
# for a single year input:
%timeit scipy.stats.weibull_min.pdf(year, k, 0, l) #72.5 µs ± 597 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.cdf(year, k, 0, l) #60.5 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.sf(year, k, 0, l) #61 µs ± 908 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# compare with:
%timeit scipy.stats.weibull_min(k, 0, l).pdf(year) #549 µs ± 6.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.weibull_min(k, 0, l).cdf(year) #547 µs ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.weibull_min(k, 0, l).sf(year) #542 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# clearly the first formulation is better

# for an array of years input:
%timeit scipy.stats.weibull_min.pdf(yeararray, k, 0, l) #87.5 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.cdf(yeararray, k, 0, l) #72.7 µs ± 712 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.sf(yeararray, k, 0, l) #75.6 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# compare with a for loop:
def weib_cdf(yearseries):
    cdfarray = np.zeros(len(yearseries))
    for i in yeararray:
        cdfarray[i] = scipy.stats.weibull_min.cdf(yearseries[i], k, 0, l)
    return cdfarray

# ensure that the for loop and the built-in treatment of numpy arrays provide the same results:
scipy.stats.weibull_min.cdf(yeararray, k, 0, l) == weib_cdf(yeararray)

%timeit weib_cdf(yeararray) #6.14 ms ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# the loop runs 100 times, and its speed is ~100 times longer than the array calculation.

# discretization comparisons
plt.plot(yeararray, scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd), label="pdf")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-1, loc=mean, scale=sd), label="(cdF)-(cdf-1)")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray+1, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd), label="(cdF+1)-(cdf)")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray+0.5, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-0.5, loc=mean, scale=sd), label="(cdF+0.5)-(cdf-0.5)")
plt.legend(loc='best', numpoints=1)

norm_pdf = scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd)
norm_cdf05 = scipy.stats.norm.cdf(yeararray+0.5, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-0.5, loc=mean, scale=sd)

plt.plot(yeararray, scipy.stats.weibull_min.pdf(yeararray, k, 0, l), label="pdf")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-1, k, 0, l), label="(cdF)-(cdf-1)")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray+1, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray, k, 0, l), label="(cdF+1)-(cdf)")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray+0.5, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-0.5, k, 0, l), label="(cdF+0.5)-(cdf-0.5)")
plt.legend(loc='best', numpoints=1)

weib_pdf = scipy.stats.weibull_min.pdf(yeararray, k, 0, l)
weib_cdf05 = scipy.stats.weibull_min.cdf(yeararray+0.5, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-0.5, k, 0, l)

nheeren commented 6 years ago

c. A nice surprise is that the speed hardly changes regardless of whether the input is a single number (i.e. a single year), or a numpy array of years which returns an array of pdf, cdf, or sf outputs. That means we can avoid loops to create survival curves and save a lot of time.

This is what I ended up doing most of the time -- pre-sampling. We could put distributions or random numbers into an array or dictionary, which is created in the beginning.

stefanpauliuk commented 6 years ago

Thanks a lot for checking the different stats methods so carefully!

I started revising the dynamic stock methods according to your suggestions. A first version is running on the master branch now, but some of the unit tests need to be rewritten as we use sf instead of pdf now. I asked Steffi, who is a PhD student in the Freiburg group, to have a look and work on it.

Using sf the inflow-driven model can now be reformulated as self.s_c = np.einsum('c,tc->tc', self.i, self.sf)

where s_c is the stock by cohort, i is the inflow, and sf the survival function array. That brevity alone makes the change worthwhile, plus that np.einsum is my favourite np function!

The model is quite fast now. I agree with Niko that the pdf / sf arrays need to be pre-sampled, which is done in the current version of the script. Once the lifetime data are stable we even can create the sf array outside the ODYM-RECC framwork and then directly load it as parameter instead of the lifetime values themselves. Right now, it takes 30-60 seconds to compute the sf array, that is OK for the moment.

stefanpauliuk commented 6 years ago

In addition, the methods were now changed to allow for a non-zere outflow from an age-cohort in year 0 already. In practice, that means that the methods can handle situations where sf[m,m] != 1. That is important for products with short lifetimes, and for modellers who do not want to use the current calculation method for the sf, which is calculated for the end of the year, assuming that the inflow, which in reality is distributed across the year, happens instantaneously at the end of the year.

TomerFishman commented 6 years ago

excellent, thank you for the quick implementations!

The Einstein sum implementation is very elegant. I’d suggest that for clarity, this row of code would probably benefit from a small comment explaining the function’s syntax. It's likely that not all in the IE and MFA community are familiar with this special syntax and the Einstein summation convention in general.

stefanpauliuk commented 6 years ago

Comment added and pushed!

IndEcol / ODYM

Correct application of lifetime model #1