NASA-IMPACT / bps-numerical

1 stars 0 forks source link

Analyze cluster "goodness" #5

Closed NISH1001 closed 2 years ago

NISH1001 commented 2 years ago

Major Changes

Usage

from bps_numerical.clustering import CorrelationClusterer, SamplingBasedClusterAnalyzer
from bps_numerical.classification.classifiers import SinglePhenotypeClassifier
from bps_numerical.feature_selection import FirstFeatureSelector

# Feature selection using clustering
clusterer = CorrelationClusterer(
    list(df_genes.columns),
    cutoff_threshold=0.55,
    debug=False
)
fs = FirstFeatureSelector(clusterer=clusterer)

cols_genes = fs.select_features(df_genes)

# generate low-dimensional data
df_merged = merge_gene_phenotype(
    pd.concat([samples, df_genes[cols_genes]], axis=1),
    CSV_PHENOTYPE,
    "Sample",
)

# final prepare for analysis
# It trains a model and does the tests accordingly
clf_condition = SinglePhenotypeClassifier(cols_genes, "condition", debug=False)
cluster_analyzer = SamplingBasedClusterAnalyzer(
    clusterer,
    __cols,
    clf_condition,
    n_replacement=500,
    max_sampling=4,
    debug=True,
)
analysis_results = cluster_analyzer.analyze(
    data_merged=df_merged,
    data_genes=df_genes,
)
SamplingBasedClusterAnalyzer.analyze_results(analysis_results["train_results"], analysis_results["eval_results"])

Minor Changes

TODO


cc: @xhagrg @muthukumaranR