UC-MACSS / persp-research_Spr18

Course site for MACS 30200 (Spring 2018) - Perspectives on Computational Research
6 stars 33 forks source link

PS2, exercise 1 help #22

Open rickecon opened 6 years ago

rickecon commented 6 years ago

Exercise 1 of Problem Set 2 asks you to use some data on U.S. bequests (inheritances), plot the raw data, and fit a kernel density estimator. The following helps respond to @jgdenby's question in Issue #21.

Part (a). The first part of this requires you to plot a 2D histogram of the original data, BQmat_orig. If you use the code suggested in the text of the exercise, you will get a NumPy matrix of dimension 78 x 7. This matrix represents three dimensions of data. The rows represent ages, the columns represent lifetime income groups, and each element represents the bequests percentage for each age-income pair. Use the plot_surface() command to make a 2D histogram (3D plot) of this original data, as shown in Section 3.1 of the KDE.ipynb notebook.

The input arguments to the plot_surface(x_mat, y_mat, data3D) function are the following. Let data3D be an m x n matrix of data to be plotted in which the m rows represent the m values of the x variable, and the n columns represent the n values of the y variable. Let the m values of the x variable be listed in a vector x_vec of length (m,), and let the n values of the y variable be listed in a vector y_vec of length (n,). Then the input x_mat to the plot_surface() function is the column vector x_vec.reshape((m, 1)) copied across n columns such that the resulting matrix x_mat has shape (m, n). The input y_mat to the plot_surface() function is the row vector y_vec.reshape((1, n)) copied down m rows such that the resulting matrix y_mat has shape (m, n). This operation to create x_mat and y_mat from x_vec and y_vec can be easily done using the np.meshgrid() function.

y_mat, x_mat = np.meshgrid(y_vec, x_vec)

Part (b). The gaussian_kde() function used in Section 3.1 of the KDE.ipynb notebook takes two arguments and returns a KDE object.

from scipy.stats import gaussian_kde

kde_object = gaussian_kde(data, bw_method)

The bw_method argument is simply the bandwidth of the KDE estimator. This is referred to as lambda in exercise 1. The data input to the function should be 2 x N for bivariate data, where N is the number of data points. The data input is two stacked vectors of x values and y values. For example, in exercise 1, the age vector is

ages_vec = np.arange(18, 96)

and the midpoints of the lifetime ability group percentiles described in exercise 1 are

abils_midpt = np.array([0.125, 0.375, 0.60, 0.75, 0.85, 0.94, 0.995])

The gaussian_kde() function needs data on the age and income variables that reflect the percentages from the raw data bq_data from part(a). The matrix bq_data has the percent of bequests received by each age-income pair. You can generate data for N=70000 observations from this distribution in the following way.

prop_mat_inc = np.sum(bq_data, axis=0)
prop_mat_age = np.sum(bq_data, axis=1)
lrg_samp = 70000
age_probs = np.random.multinomial(lrg_samp, prop_mat_age)
income_probs = np.random.multinomial(lrg_samp, prop_mat_inc)
age_freq = np.array([])
inc_freq = np.array([])

# creating a distribution of age values
for age, num_s in zip(ages_vec, age_probs):
    vec_age_s = np.ones(num_s)
    vec_age_s *= age
    age_freq = np.append(age_freq, vec_age_s)

# creating a distribution of ability type values
for abil, num_j in zip(abils_vec, income_probs):
    vec_abil_j = np.ones(num_j)
    vec_abil_j *= abil
    inc_freq = np.append(inc_freq, vec_abil_j)

data = np.vstack((age_freq, inc_freq))
density = gaussian_kde(data, bw_method=bandwidth)

You can then estimate the value of the KDE smoothed bequests distribution by inputting data for the exact points of the ages_vec and abils_vec, expanded to matrices like x_mat and y_mat above, and then flattened out into pairwise data.

coords = np.vstack([item.ravel() for item in [ages_mat, abils_mat]])
BQkde = density(coords).reshape(ages_mat.shape)
BQkde_scaled = BQkde / BQkde.sum()

Now you can just print your result using plot_surface() for which the inputs are matrices, and you can experiment with different bandwidths to see which resulting KDE estimator looks most like the original data with the noise smoothed out.

AlexanderTyan commented 6 years ago

BQkde_scaled = BQkde / BQkde.sum() should be? BQkde_scaled = BQkde / np.sum(BQkde)

AlexanderTyan commented 6 years ago

Hmm, my KDE surface plot looked weird, didn't seem to match the part A, so I think: for abil, num_j in zip(abils_vec, income_probs): should be for abil, num_j in zip(abils_midpt, income_probs):

Courtesy of @Sun-Kev

jgdenby commented 6 years ago

When plotting my KDE surface plot, it appears to be flipped along the income group dimension relative to my plot in part A – did anyone else have this issue?