KrishnaswamyLab / SAUCIE

Other
98 stars 29 forks source link

SAUCIE Produces Arbitrary Clusters #13

Closed biskra closed 5 years ago

biskra commented 5 years ago

Hello,

First off, I want to congratulate Matthew Amodio for getting his paper accepted in Nature Methods. SAUCIE has the potential to make a major impact in CytOF data analysis and I am very excited to be using this software!

I am characterizing a dataset with SAUCIE that is composed of 780,000 cells with 32 markers. I am currently running the following settings:

saucie = SAUCIE(data.shape[1], layers=[512, 256, 128, 2], layer_c=2, lambda_c=0.1, lambda_d=0.2, limit_gpu_fraction=0.5)

And I am training for 40,000 steps with a minibatch size of 256. The CytOF data has been normalized, compensated, debarcoded, and arcsinh5 transformed.

I am concerned that my clustering parameters are not well tuned or appropriate.

My clustering results look like the following: SAUCIE Clusters on a UMAP Embedding (Pardon the axes, I just noticed this as I uploaded.) image

SAUCIE Clusters on the SAUCIE Embedding image

The clusters seem to be arbitrary with respect to the data manifold and ignores major features in cells.

What am I missing and how can I best optimize SAUCIE to work with my data? What are the recommended optimization steps for SAUCIE?

Thank you!

Brian

mattamodio commented 5 years ago

Thank you for your kind words and for using SAUCIE! The clusters it is currently identifying are coherent blocks of similar cells on the data manifold, which I would say is a reasonable thing for an unsupervised clustering method to do. However, unlike in say scRNA-seq, in CyTOF we know a few things about the column space (markers) a priori that we can take advantage of. For example, right now it might be separating a group of cells based on variation in a marker that we know does not distinguish cell types. But in the data, this marker variation is on equal footing with any other marker variation, including those that do distinguish cell types.

There are many ways to alleviate this dilemma. First, as we do in our paper, you can gate the cells and cluster each cell type separately (we clustered T cells and analyzed their clusters). Another option would be to binarize some markers that act like +/- switches and are used as lineage markers (by for example setting all + cells to some high constant like 20 and all - cells to 0). This would result in a very high penalty for placing cells with that marker in the same cluster, but it would allow them all to be run at once. Another option, which we plan on including as a feature in future release, is to tell the network which columns correspond to lineage markers and upweight their importance in the loss function calculation.

In other words, naive unsupervised clustering of CyTOF data with a standard distance function ignores that some columns mean very different things in this type of data. That being said, possible hyperparameter tuning steps you could try: leave layer_c at its default 0, as it being closer to the input sometimes can help, increase lambda_c a small amount to .11-.13 to get a slightly coarser-grain clustering, increasing the bottleneck layer's dimensions to 5 or 10 for obtaining clusters (compressing a large number of cells into a small space can be difficult and this gives it more room) and running a second time for visualization.

Hopefully this helped! Given that the clusters seem to be pretty coherent and contiguous but are distinguishing cell similarities/differences other than cell type, I think the most likely step to help would be gating, but if not hopefully one of the other suggestions helps!

biskra commented 5 years ago

Matt,

After looking at a heatmap of median expression values, the clustering starting becoming more coherent to me. I believe my initial impression of these clusters on UMAP are more representative of the limitations of probing with embeddings vs actually looking at quantitative data. In sum, I'm really loving the results I'm getting from SAUCIE!

I've thought a lot about the concept of lineage markers vs unbiased clustering, and I think lineage markers are mostly important for immunologists in the context of concordance with historical flow cytometry findings. I also think that this aids in meta-clustering and can lead to obscuring of the data manifold, which may be a mechanism to better analyze datasets with strong batch effects. I don't think there are many high depth CytOF friendly batch correction approaches either, which would be amenable to such a sensitive analysis (again, thanks for making SAUCIE). Ultimately, I'm happy working without lineage markers, because I'm not working within a strict immunological context.

Thank you for the advice on tuning! I think I'm going to either run SAUCIE again for visualization after compressing to 10 dimensions OR running UMAP on the resulting hyperspace. UMAP isn't too bad in terms of time complexity, as I typically run 30 neighbors. Most of the computational weight (I've found) is derived from the number of cells, and reducing the markers from 32 to 10 leads to considerable speedup.

I have a couple of new batches of data that I will be collecting soon, so I will start playing around with batch correction. Said batches will have a reference sample, so I'm super stoked to test SAUCIE on them. Batch correction has really been limited in the field, with much of the heavy lifting done by barcoding and staining a lot of samples at once. I can't wait to see how the reference aligns between batches. Do you have any advice on batch correction tuning?

Again, thank you so much! This work has really enabled a super fast and awesome analysis for my dissertation work.