How to obtain UMAP coordinates and cluster ID per cell?

denvercal1234GitHub commented 1 year ago

Hi there,

Thanks for the package.

The output of the pipeline produced FCS files and pdf of UMAP images.

I do aim to use these FCS files in other packages, such as Spectre package to analyse the data, but is there a way I could have the UMAP coordinates and clustering IDs of each cell produced by infinityFlow pipeline, so I can explore the UMAP programmatically instead of just having the PDFs?

The reason is because it seems there are particular steps needed to do for the imputed FCS files outputed from infinityFlow pipeline mentioned in the manuscript as below. Would you mind specifying how to do these steps for the imputed FCS so we can get the accurate UMAPs?

For dimensionality reduction, we used the UMAP algorithm (29) using its R implementation in the uwot package. UMAP was run with parameters n_neighbors = 15, min_dist = 0.2, metric = “euclidean,” and n_epochs = 2000. To run UMAP on imputed data, we first re- moved every events whose PE intensity was higher than 1/32 of the cytometer’s manufacturer-reported maximum linear range, as non-linear effects were causing compensation issues for these events (fig. S14), and applied background correction to the imputed intensities. To plot color-coded markers’ intensities on UMAP embeddings, we truncated the imputed intensities data vectors to their lower and upper 5/1000 quantiles.

Thank you.

ebecht commented 1 year ago

Hi @denvercal1234GitHub

By default you should be getting new channels in your FCS files with names "UMAP1" and "UMAP2" that correspond to (linearly-rescaled) UMAP embeddings of the backbone.

Clustering is not part of the pipeline so you would have to run that yourself.

If you want to run UMAP or clustering on the imputed data, this is not part of the pipeline at the moment. The two advices I would give you are to

Use the background-corrected files if you autofluorescence / unspecific binding is affecting the PE intensity (see Sup Fig 5 of the paper)
Exclude events with close to saturating measurements on the PE channel. If you don't know what I am talking about, you may want to color the UMAP based on raw PE intensity (across all wells in parallel). Cells with PE saturation issues should appear as distinct clumps on the UMAP (Sup Fig 14 of the paper).

Best, Etienne

denvercal1234GitHub commented 1 year ago

Hi @ebecht - Thank you so much for your response.

QUESTION 1. When you mentioned "exclude events with close to saturating measurements on the PE channel," you meant I should visualize the clusters on UMAP (using raw expression) after performing clustering downstream of infinityFlow in order to identify these events? If so, how would you programmatically select these events though, especially they appeared to spread across multiple clusters? Or, you meant after performing downstream clustering, you simply remove events with raw PE expression above certain values?

QUESTION 2. Do you recommend to filter out all events across FCS files that have expression above 10^6, before doing inifinityFlow? My data was acquired using Cytek Aurora 4 lasers.

From issue #11: How did you remove these "undercompensated" events or you remove the whole FCS file that contained these events? Do you have a quick script for this removal you would not mind sharing? I did not see it in the codes deposited.

(Your point 2 and hence my 2 questions above are responding to issue #11; once answered here, I will close issue 11. Thanks again for your help!)

ebecht commented 1 year ago

I am not super familiar with spectral flow sadly, so I cannot precisely answer your questions.

For conventional flow cytometry, the levels at which detectors saturate are specified by vendors (although on our data we thought that vendors were a bit optimistic). They way we did this in our paper was to use the manufacturer thresholds/32 as a cutoff to exclude events.

One way to verify this is to color the UMAP embedding of the backbone data by raw PE intensity. Any clump of cells with very bright PE intensity is likely caused by saturation (which causes non-correctable spillover issues in other channels and thus is seen in the UMAP embedding of the backbone data).

Yes I think that you should remove these events across all files

denvercal1234GitHub commented 1 year ago

Thanks @ebecht! For Q1, how did you come up with the number 32?

My apologies, but the QC of the imputed data is giving me a bit of issues. Really looking forward to your QC tools (plots to identify saturation and look at prediction versus measurements on held-out events with computed AUC and r) as mentioned in issue #19.

Thanks again for your help!

ebecht commented 1 year ago

Hi,

The number 32 came from the analysis of the UMAP embedding on imputed data. It showed that cells corresponding to satured predictions corresponded to the the manufacturer's saturation limit/32. This may depend on the instrument.

You can check supplementary figure 15 of our paper to see how to choose that threshold for your dataset. The goal is basically to exclude those small clusters of cells with high PE intensity. These are problematic because saturation means that other backbone channels are undercompensated for PE, which means that the models see events with a specific backbone signature associated with high PE signal and end up predicting high expression for any saturating antibody on these cells (which is as far as I understand it perfectly sensible and a technical artifact due to the measurements themselves).

SupFig15

ebecht / infinityFlow

How to obtain UMAP coordinates and cluster ID per cell? #14