Add bps_numerical.clustering.SamplingBasedClusterAnalyzer that performs random replacement to original genes by the genes from the same cluster and tests the model. (See docstring for more detail)
Usage
from bps_numerical.clustering import CorrelationClusterer, SamplingBasedClusterAnalyzer
from bps_numerical.classification.classifiers import SinglePhenotypeClassifier
from bps_numerical.feature_selection import FirstFeatureSelector
# Feature selection using clustering
clusterer = CorrelationClusterer(
list(df_genes.columns),
cutoff_threshold=0.55,
debug=False
)
fs = FirstFeatureSelector(clusterer=clusterer)
cols_genes = fs.select_features(df_genes)
# generate low-dimensional data
df_merged = merge_gene_phenotype(
pd.concat([samples, df_genes[cols_genes]], axis=1),
CSV_PHENOTYPE,
"Sample",
)
# final prepare for analysis
# It trains a model and does the tests accordingly
clf_condition = SinglePhenotypeClassifier(cols_genes, "condition", debug=False)
cluster_analyzer = SamplingBasedClusterAnalyzer(
clusterer,
__cols,
clf_condition,
n_replacement=500,
max_sampling=4,
debug=True,
)
analysis_results = cluster_analyzer.analyze(
data_merged=df_merged,
data_genes=df_genes,
)
SamplingBasedClusterAnalyzer.analyze_results(analysis_results["train_results"], analysis_results["eval_results"])
Minor Changes
Now bps_numerical.classification.classifiers.SinglePhenotypeClassifer.train(...) method includes train/test indices of splitted data
This might come handy on downstream tasks (like analyzing the cluster once we train a model)
bps_numerical.misc.datatools.train_test_indexed_split is added to return indices as well as splitted data.
Major Changes
bps_numerical.clustering.SamplingBasedClusterAnalyzer
that performs random replacement to original genes by the genes from the same cluster and tests the model. (See docstring for more detail)Usage
Minor Changes
bps_numerical.classification.classifiers.SinglePhenotypeClassifer.train(...)
method includes train/test indices of splitted databps_numerical.misc.datatools.train_test_indexed_split
is added to return indices as well as splitted data.TODO
cc: @xhagrg @muthukumaranR