Predictive EDA - Githubissues

This will be done to asses the contribution of each channel to the varying probability of a positive classification in the context of a key feature aka FOI (feature of interest) . I propose we do this in the following way:

Since our data for each feature represent count evidence scores for each information channel, we can stratify the data into "evidence bins" using equal frequency binning.:

Evidence bins: zero-evidence: x == 0 low-evidence: 0 < x < 25% max value moderate-evidence: 25% max <= x < 75% max value high-evidence: 75% max <= x < max value

These bins then act as reservoirs for the sampling of X data during inference. Once the Bayesian model is determined, this will allow parameter distributions to be configured for each feature. We can then estimate and visualize how the interaction probability varies for a given channel given that all other features are kept to a representative evidence range e.g, all features apart from the feature of interest (FOI) is sampled from the desired evidence bin. Conversely, the FOI data is sampled from a continuous linear distribution which extends across all evidence ranges.

To illustrate: Training: 1) Train model on all features regardless of their ranges 2) Map each features value to a evidence bin: evidence_pools = {feature_1: {"moderate": x <= [1,50),..., "High": x <=[150, max]}, ..., feature_n: ...}

Inference: 1) Select a FOI e.g. "fusion" - use all values in simulated linear range (bounded by the range of its training data: simulated_fusion_data = np.linspace(0, max(x), N) 2) For all other non-FOI features: sample N datapoints from the respective evidence bins:

for feature in features_excluding_FOI:
    evidence_groups = evidence_pools[feature]
    for group in evidence_groups:
        sampled_range_specific_data = group.samples(N//n_pools)
        ...

The above pseudo-code will sample data from all 5 evidence bins for each feature e.g: experiments: [zero_evidence_set, low_evidence_set, moderate_evidence_set, high_evidence_set] textmining: [zero_evidence_set, low_evidence_set, moderate_evidence_set, high_evidence_set] database: [zero_evidence_set, low_evidence_set, moderate_evidence_set, high_evidence_set]

Once we have the above datasets, we can plot how the probability of interaction changes with increasing evidence for a chosen FOI, in the context of different evidence intervals for the rest of the data e.g:

FOI = fusion

for sample in n_theta_sample_runs:
    # Sample thetas from respective parameter distribution
    # Sample data from respective evidence distribution
    # Use predefined linear sampling for FOI

    prediction_zero = sigma( theta_0 + theta_1 * np.sample(Experiments[zero_evidence_set]) +  theta_2 *  np.sample(textmining[zero_evidence_set])+ ....  theta_n * simulted_FOI_data)

    prediction_low = sigma( theta_0 + theta_1 * np.sample(Experiments[low_evidence_set]) +  theta_2 * np.sample(Experiments[low_evidence_set])+ ....  theta_n * simulted_FOI_data)

    prediction_moderate = sigma( theta_0 + theta_1 * np.sample(Experiments[moderate_evidence_set]) +  theta_2 * np.sample(Experiments[moderate_evidence_set])+ ....  theta_n * simulted_FOI_data)

   prediction_high = sigma( theta_0 + theta_1 * np.sample(Experiments[high_evidence_set]) +  theta_2 * np.sample(Experiments[high_evidence_set])+ ....  theta_n * simulted_FOI_data)

Since the parameters are not point estimates e.g MLE, but are predictive distributions, we can generate credible intervals for each prediction (shown as the transparent colours) and investigate where the greatest degree of uncertainty lies. It also allows the determination of which feature provides the greatest increase in predictive performance in consideration of the level of evidence supporting other features.

Sum02dean / STRINGSCORE

Predictive EDA #21