address slow heatmaps - Githubissues

sreichl commented 1 year ago

define too large: e.g., >10,000 samples/cells?

ideas

for large data (define too large?) do not do heatmaps showing features and data, but instead determine distance matrices and show those via the given metric eg correlation.
do not plot it if either observations or dimensions exceed e.g., 50k
use Heatgraphy, a new visualization package for multi-dimensional data.
- GitHub repo: https://github.com/Heatgraphy/heatgraphy
- Documentation: https://heatgraphy.readthedocs.io/en/latest/index.html
- Web version: https://heatgraphy.streamlit.app/
fast distance matrix computation (which metric?)
(favorite) downsample to equal size of the smallest group provided in the metadata with a minimum of 100 or 10? min(100,table(metadata$column))

sreichl commented 4 months ago

Two options

Downsample if observations/features > 50k
- downsample observations to equal size of the smallest group provided in the metadata with a minimum of 100 or 10? min(100,table(metadata$column))
- downsample features to "most variable"
Precompute
- distance matrix
- in python with scipy.spatial.distance.pdist supports 22 metrics or even faster fastdist
- in R with Rfast supports 21 distance metrics p80 of reference
- hierarchical cluster with fastcluster (available for R and python and probably equally fast as both implementations are essentially in C++) https://danifold.net/fastcluster.html
- plot using ggplot geom_tiles (without dendrogram) or provide ComplexHeatmap with precomputed dendrograms
```
library(fastcluster)
library(ComplexHeatmap)
```
data <- matrix(rnorm(10000), nrow = 1000)

Compute the distance matrix

dist_matrix <- dist(data)

Perform hierarchical clustering

row_hc <- fastcluster::hclust(dist_matrix, method = "complete") col_hc <- fastcluster::hclust(dist(t(data)), method = "complete") Heatmap(data, cluster_rows = as.dendrogram(row_hc), cluster_columns = as.dendrogram(col_hc))
```
- [x] check where the most options in terms of `distance metrics` and `hierarchical clustering methods` are provided.
```

sreichl commented 4 months ago

tried with fastdist but abandoned, due to Error when using metric correlation. Anyway probably not as stable as scipy (although faster).

Traceback (most recent call last):
  File "/research/home/sreichl/projects/unsupervised_analysis/.snakemake/scripts/tmpj83a3tj1.distance_matrix.py", line 43, in <module>
    dist_mtx = fastdist.matrix_pairwise_distance(data_np, metric_function, metric, return_matrix=True) 
ZeroDivisionError: division by zero

sreichl commented 4 months ago

Observation downsampling: random with random seed to 1000?

Feature cut off by variability to 10k

Or both as configurable parameters? Sample_proportion highly_variable_feature_proportion

sreichl commented 4 months ago

Downsampling done in distance matrix step

Heatmap script: filter data & metadata for downsampled observations/features

sreichl commented 4 months ago

Both configs accept float 0-1 as proportion or int as the absolute number of observations/features to be downsampled to.

epigen / unsupervised_analysis

address slow heatmaps #4

Compute the distance matrix

Perform hierarchical clustering