MarioniLab / MammaryGland

7 stars 2 forks source link

Speed up bootstrapping #11

Open LTLA opened 7 years ago

LTLA commented 7 years ago

You could probably speed up the boostrapping by supplying the distance matrix to clusterboot.

LTLA commented 7 years ago

You could also shuffle expression values and see what a "null" Jaccard coefficient would be:

original <- matrix(runif(10000), ncol=1000) # your original expression matrix
shuffled <- sample(original)
dim(shuffled) <- dim(original)
out <- clusterboot(dist(t(shuffled)), clustermethod=disthclustCBI, method="average", k=5)

This gives me Jaccard coefficients from 0.1 to 0.3.

LTLA commented 7 years ago

Having looked at the theory behind fpc, I'm not sure that their bootstrapping strategy is sensible. They resample cells to create a new bootstrap sample - on the surface, this is reasonable enough, as the resampled set can be considered a replicate of the original population. However, if you did have two replicate experiments, you would never consider a cell from the first experiment to be exactly the same as one of the cells in the second experiment. Calculation of the Jaccard index between replicate experiments makes no sense; which means that you shouldn't do it with bootstrap replicates, either, if the statistical meaning of the bootstrap is to be preserved.

The relevant questions should instead be population-level, e.g., is there a cluster in the replicate experiment that is equivalent to my original cluster, in terms of size/expression pattern/etc. If I managed to find a Csn2-high cluster in all of my replicates, I would be pretty confident that the cluster is reproducible. (Whether it constitutes a separate cell type, though, is another question entirely, and not something that is answerable with boostrapping.) However, this approach would require some effort to implement - perhaps something for the next scran release.

In the meantime, silhouette plots are the way to go. They should look nice for this data set.

jcmarioni commented 7 years ago

This is an interesting point - can we discuss more about this on Monday Aaron - it depends (I think) quite a lot on what set of cells you use as input - I have a feeling we might be okay here but would like to discuss more…


John Marioni PhD Research Group Leader, EMBL-EBI Associate Faculty Member, WT Sanger Institute Wellcome Genome Campus CB10 1SD, UK

Senior Group Leader CRUK Cambridge Institute University of Cambridge CB2 0RE, UK

On 21 Apr 2017, at 23:17, Aaron Lun notifications@github.com wrote:

Having looked at the theory behind fpc, I'm not sure that their bootstrapping strategy is sensible. They resample cells to create a new bootstrap sample - on the surface, this is reasonable enough, as the resampled set can be considered a replicate of the original population. However, if you did have two replicate experiments, you would never consider a cell from the first experiment to be exactly the same as one of the cells in the second experiment. Calculation of the Jaccard index between replicate experiments makes no sense; which means that you shouldn't do it with bootstrap replicates, either, if the statistical meaning of the bootstrap is to be preserved.

The relevant questions should instead be population-level, e.g., is there a cluster in the replicate experiment that is equivalent to my original cluster, in terms of size/expression pattern/etc. If I managed to find a Csn2-high cluster in all of my replicates, I would be pretty confident that the cluster is reproducible. (Whether it constitutes a separate cell type, though, is another question entirely, and not something that is answerable with boostrapping.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MarioniLab/MammaryGland2017/issues/11#issuecomment-296318202, or mute the thread https://github.com/notifications/unsubscribe-auth/AP2pyTnhZQtFFOMAFJ1tEEZtsGnAK2p7ks5rySr7gaJpZM4NEZsX.