Major Changes

Add bps_numerical.clustering.SamplingBasedClusterAnalyzer that performs random replacement to original genes by the genes from the same cluster and tests the model. (See docstring for more detail)

Usage

from bps_numerical.clustering import CorrelationClusterer, SamplingBasedClusterAnalyzer
from bps_numerical.classification.classifiers import SinglePhenotypeClassifier
from bps_numerical.feature_selection import FirstFeatureSelector

# Feature selection using clustering
clusterer = CorrelationClusterer(
    list(df_genes.columns),
    cutoff_threshold=0.55,
    debug=False
)
fs = FirstFeatureSelector(clusterer=clusterer)

cols_genes = fs.select_features(df_genes)

# generate low-dimensional data
df_merged = merge_gene_phenotype(
    pd.concat([samples, df_genes[cols_genes]], axis=1),
    CSV_PHENOTYPE,
    "Sample",
)

# final prepare for analysis
# It trains a model and does the tests accordingly
clf_condition = SinglePhenotypeClassifier(cols_genes, "condition", debug=False)
cluster_analyzer = SamplingBasedClusterAnalyzer(
    clusterer,
    __cols,
    clf_condition,
    n_replacement=500,
    max_sampling=4,
    debug=True,
)
analysis_results = cluster_analyzer.analyze(
    data_merged=df_merged,
    data_genes=df_genes,
)
SamplingBasedClusterAnalyzer.analyze_results(analysis_results["train_results"], analysis_results["eval_results"])

Minor Changes

Now bps_numerical.classification.classifiers.SinglePhenotypeClassifer.train(...) method includes train/test indices of splitted data
- This might come handy on downstream tasks (like analyzing the cluster once we train a model)
bps_numerical.misc.datatools.train_test_indexed_split is added to return indices as well as splitted data.

TODO

Document overall pipeline to README
shap-based feature analysis

cc: @xhagrg @muthukumaranR

NASA-IMPACT / bps-numerical

Analyze cluster "goodness" #5

Major Changes

Usage

Minor Changes

TODO