Suggestion related to cluster IDs given #7

lcolladotor commented 3 years ago

Hi Aaron,

Currently .limit_cluster_size() uses https://github.com/LTLA/scuttle/blob/3cb28efbf237ecb9b7a1973ab7e8957371260a32/R/pooledSizeFactors.R#L461

such that if you provide a set of input clusters that are a factor as those produced by quickCluster(), it'll re-order the clusters for you as shown with the example below. If we switch from unique(clusters) to sort(unique(clusters))(or potentially levels(clusters) if we can assume that it's a factor at that point or levels(factor(clusters))) then we avoid this re-ordering (implemented in .limit_cluster_size.fix() below).

> factor(rep(c(2, 3, 1), c(5, 4, 3)))
 [1] 2 2 2 2 2 3 3 3 3 1 1 1
Levels: 1 2 3
> unique(factor(rep(c(2, 3, 1), c(5, 4, 3))))
[1] 2 3 1
Levels: 1 2 3
> levels(factor(rep(c(2, 3, 1), c(5, 4, 3))))
[1] "1" "2" "3"
> scuttle:::.limit_cluster_size(factor(rep(c(2, 3, 1), c(5, 4, 3))), 10)
 [1] 1 1 1 1 1 2 2 2 2 3 3 3
Levels: 1 2 3
> .limit_cluster_size.fix(factor(rep(c(2, 3, 1), c(5, 4, 3))), 10)
 [1] 2 2 2 2 2 3 3 3 3 1 1 1
Levels: 1 2 3

The relevance of this is that currently the warning mentioned at https://github.com/LTLA/scuttle/issues/7#issuecomment-778710244 (introduced at https://github.com/LTLA/scuttle/commit/0ed602b33c839b9cad770d5871b7436ebe993caf) gives a cluster ID that doesn't actually match the input cluster from the user.

Though hm... having said that, if any cluster exceeds max.cluster.size the change I'm suggesting won't really help a user. Hm....

Extra

Here's a quick check that I used to verify that indeed the clusters didn't change in my case (given that none exceed the max.cluster.size).

## checking the quickCluster() output (saved as spe$scran_quick_cluster)
## against the clusters after running the current .limit_cluster_size() internal code
m_table <- data.frame(
    original = seq_len(length(unique(spe$scran_quick_cluster)))
)
m_table$new <- match(m_table$original, as.integer(unique(spe$scran_quick_cluster)))
identical(factor(m_table$new[as.integer(spe$scran_quick_cluster)]), clusters)
# [1] TRUE

## The warning mentions cluster 64 but it's actually the cluster 85 from the original input
m_table[m_table$new == 64, ]
#    original new
# 85       85  64

LTLA commented 3 years ago

Yes, the internal cluster creation could preserve the original cluster names, e.g., by appending -1 to the end or something.

LTLA commented 3 years ago

This is now what is done; if any cluster exceeds the max.size, they all get -X slapped on them. Those clusters that don't exceed the max size get -1 pasted on the back, those that do get -1, -2, etc.

It is necessary to modify the names of the small clusters in case the input clusters have names like A-1. For example, if a large cluster is named A and a small cluster is named A-1, then if I only add -1 to the former, I'd get a name conflict with the latter.

LTLA / scuttle

Suggestion related to cluster IDs given #7 #8

Extra