epigen / unsupervised_analysis

A general purpose Snakemake workflow and MrBiomics module to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.
https://epigen.github.io/unsupervised_analysis/
MIT License
24 stars 3 forks source link

address slow heatmaps #4

Closed sreichl closed 4 months ago

sreichl commented 1 year ago

define too large: e.g., >10,000 samples/cells?

ideas

sreichl commented 4 months ago

Two options

sreichl commented 4 months ago

tried with fastdist but abandoned, due to Error when using metric correlation. Anyway probably not as stable as scipy (although faster).

Traceback (most recent call last):
  File "/research/home/sreichl/projects/unsupervised_analysis/.snakemake/scripts/tmpj83a3tj1.distance_matrix.py", line 43, in <module>
    dist_mtx = fastdist.matrix_pairwise_distance(data_np, metric_function, metric, return_matrix=True) 
ZeroDivisionError: division by zero
sreichl commented 4 months ago

Observation downsampling: random with random seed to 1000?

Feature cut off by variability to 10k

Or both as configurable parameters? Sample_proportion highly_variable_feature_proportion

sreichl commented 4 months ago

Downsampling done in distance matrix step

Heatmap script: filter data & metadata for downsampled observations/features

sreichl commented 4 months ago

Both configs accept float 0-1 as proportion or int as the absolute number of observations/features to be downsampled to.