Closed lcolladotor closed 3 years ago
Yes, the internal cluster creation could preserve the original cluster names, e.g., by appending -1
to the end or something.
This is now what is done; if any cluster exceeds the max.size
, they all get -X
slapped on them. Those clusters that don't exceed the max size get -1
pasted on the back, those that do get -1
, -2
, etc.
It is necessary to modify the names of the small clusters in case the input clusters have names like A-1
. For example, if a large cluster is named A
and a small cluster is named A-1
, then if I only add -1
to the former, I'd get a name conflict with the latter.
Hi Aaron,
Currently
.limit_cluster_size()
uses https://github.com/LTLA/scuttle/blob/3cb28efbf237ecb9b7a1973ab7e8957371260a32/R/pooledSizeFactors.R#L461such that if you provide a set of input
clusters
that are afactor
as those produced byquickCluster()
, it'll re-order the clusters for you as shown with the example below. If we switch fromunique(clusters)
tosort(unique(clusters))
(or potentiallylevels(clusters)
if we can assume that it's afactor
at that point orlevels(factor(clusters))
) then we avoid this re-ordering (implemented in.limit_cluster_size.fix()
below).The relevance of this is that currently the warning mentioned at https://github.com/LTLA/scuttle/issues/7#issuecomment-778710244 (introduced at https://github.com/LTLA/scuttle/commit/0ed602b33c839b9cad770d5871b7436ebe993caf) gives a cluster ID that doesn't actually match the input cluster from the user.
Though hm... having said that, if any cluster exceeds
max.cluster.size
the change I'm suggesting won't really help a user. Hm....Extra
Here's a quick check that I used to verify that indeed the clusters didn't change in my case (given that none exceed the
max.cluster.size
).