Integrated data: need to set reduction argument when computing UMAP

Dear developer team of scDEED,

Thank you so much for developing this great new tool. I've been enjoying exploring my data with this new concept proposed in your paper.

My results with a single sample make much sense, but not those with my multi-sample data which have been integrated using SeuratV5. (I understand that a lot of things changed in version 5 but I don't think that's the main point here.) My scDEED results showed that there are so many cells identified as dubious, which appears weird (For my runs with an individual sample, I usually got around 10 to less than 100 dubious cells among a 5000-cell sample; but when I ran scDEED on my integrated samples, the number of dubious cells increased to more than 100 or even 200 cells per sample. This appears a bit strange to me.)

Since I used SCTransform for normalizing my data (within each individual sample), when I ran UMAP with my integrated data, I had to specify the reduction to call so that UMAP would be calculated on the integrated & dimension-reduced embedding:

seurat_obj <- RunUMAP(
    seurat_obj, 
    reduction = "CCAintegration", 
    dims = 1:50,
    reduction.name = "umap.integrated")

This is consistent with the tutorial on the bottom part for SCTransform

So the first thing that I immediately realized was that the default_assay argument inside scDEED() has to be modified to Integrated. The first thing that I would suggest for the next version is to use the active assay rather than having to manually set it to the other assays, because it is counterintuitive that even if you have set the active assay of the Seurat Object to something else, scDEED is still calculating the results based on the RNA assay.

Another thing that I realized was that it is not clear when running UMAP inside scDEED, whether the correct reduction for calculating UMAP (which should be "CCAintegration" in this case but not "pca") could be identified. I read through the manual in the scDEED() function, and saw the pre_embedding argument. As the calculated UMAP should be compared to the "CCAintegration" embedding rather than "pca" (if my understanding is correct), I simply hope to clarify whether setting the pre_embedding argument to "CCAintegration" is doing what is expected.

Related to this, I wonder if the slot argument needs to be changed because it becomes unclear what the permutated data (for estimating the null distribution) should be in this case. The two major differences here (SCTransform normalization rather than log, and integrated data rather than a single sample) compared to the examples in the tutorial may require additional calculation outside of scDEED()? I just haven't been able to figure it out so far and hope to discuss this with the developer team.

Again, thank you so much for developing this tool! I indeed noticed some very important differences when I inspected our data again under the optimized parameters.

Best regards, Jason Leong.

Dear Jason,

Thank you very much for your detailed feedback, which has been extremely helpful for us to improve the package. I’m including my student Christy Lee, who has been actively maintaining the scDEED package, in this email, as well as the first author Lucy Xia.

I think your first two points are valid, and we should probably have a separate tutorial for integrated data (do you think this is a good idea?). Also, I like the active assay idea.

Regarding your last question about the permuted data when the input data is integrated data, I would say that we would permute the integrated data by treating it as one dataset. The rationale is that scDEED only evaluates the 2D visualization step, not the previous steps including integration. I hope this makes sense.

Once we have fixed these issues, Christy will reply to this thread. Thanks so much, Christy!

On Thu, Aug 15, 2024 at 9:57 AM Jason Leong @.***> wrote:

Dear developer team of scDEED,

Thank you so much for developing this great new tool. I've been enjoying exploring my data with this new concept proposed in your paper.

My results with a single sample make much sense, but not those with my multi-sample data which have been integrated using SeuratV5. (I understand that a lot of things changed in version 5 but I don't think that's the main point here.) My scDEED results showed that there are so many cells identified as dubious, which appears weird (For my runs with an individual sample, I usually got around 10 to less than 100 dubious cells among a 5000-cell sample; but when I ran scDEED on my integrated samples, the number of dubious cells increased to more than 100 or even 200 cells per sample. This appears a bit strange to me.)

Since I used SCTransform for normalizing my data (within each individual sample), when I ran UMAP with my integrated data, I had to specify the reduction to call so that UMAP would be calculated on the integrated & dimension-reduced embedding:

seurat_obj <- RunUMAP( seurat_obj, reduction = "CCAintegration", dims = 1:50, reduction.name = "umap.integrated")

This is consistent with the tutorial on the bottom part for SCTransform https://satijalab.org/seurat/articles/integration_introduction

So the first thing that I immediately realized was that the default_assay argument inside scDEED() has to be modified to Integrated. The first thing that I would suggest for the next version is to use the active assay rather than having to manually set it to the other assays, because it is counterintuitive that even if you have set the active assay of the Seurat Object to something else, scDEED is still calculating the results based on the RNA assay.

Another thing that I realized was that it is not clear when running UMAP inside scDEED, whether the correct reduction for calculating UMAP (which should be "CCAintegration" in this case but not "pca") could be identified. I read through the manual in the scDEED() function, and saw the pre_embedding argument. As the calculated UMAP should be compared to the "CCAintegration" embedding rather than "pca" (if my understanding is correct), I simply hope to clarify whether setting the pre_embedding argument to "CCAintegration" is doing what is expected.

Related to this, I wonder if the slot argument needs to be changed because it becomes unclear what the permutated data (for estimating the null distribution) should be in this case. The two major differences here (SCTransform normalization rather than log, and integrated data rather than a single sample) compared to the examples in the tutorial may require additional calculation outside of scDEED()? I just haven't been able to figure it out so far and hope to discuss this with the developer team.

Again, thank you so much for developing this tool! I indeed noticed some very important differences when I inspected our data again under the optimized parameters.

Best regards, Jason Leong.

— Reply to this email directly, view it on GitHub https://github.com/JSB-UCLA/scDEED/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZP7H4YPIKOIIDWCROOS7LZRTMWVAVCNFSM6AAAAABMSRMKZSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DQNBYGYYDSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best, Jessica

Jingyi Jessica Li (李婧翌), Ph.D.

Professor Department of Statistics and Data Science (Primary) Departments of Biostatistics, Computational Medicine, and Human Genetics (Secondary) University of California, Los Angeles http://jsb.ucla.edu Twitter: @jsb_ucla

Hello Jason, Thanks for your feedback; the integration + CCA problem was something I had not previously considered.

Specifying the pre_embedding is sort of right; the problem is the Permuted function is only set up for PCA. The slot argument only specifies the input into PCA.

Thus I think the solution is to pull the integrated CCA space from the original object, permute it, then add it back to create a permuted object. We then supply scDEED with the permuted object and specify the pre_embedding slot. I added a new bit to the tutorial.

I also changed the default_assay option to automatically use the active.assay, unless user specifies otherwise.

Best, Christy

Dear Christy and Jessica,

Thank you so much for your prompt replies!

I have quickly tried out the code in the new tutorial section ("Working with integrated data") using my own data. The results now make much more sense, with a total of 40 cells identified as dubious (instead of hundreds) under the default RunUMAP() parameters across two samples. In addition, the cells/clusters identified as dubious or trustworthy in the integrated data seem to be consistent with those analyzed separately in the individual samples. Now I am moving on to the grid search to look for the optimal parameters.

I think the issue that I was initially stuck on was likely due to using incorrect permuted data.

When I was running the new code in the tutorial the first time, I somehow got results that were exactly the same as what I got previously (when I simply set pre_embedding = "CCAintegration" and assay="SCT.integrated"), with very few trustworthy cells and a tremendous amount of intermediate and dubious cells.
However, I realized that I probably forgot to actually execute the permutation step from my script (i.e., the for loop in the new tutorial section), because when I checked the Embeddings(data.permuted[["CCAintegration"]]), the matrix and the was exactly the same as the embeddings in the original data.
Then after making sure that I have run everything properly again, now the scDEED results become basically consistent compared to the results I got when I analyzed the individual samples separately. I also checked that the embedding matrix from the permuted data.permuted is indeed different from the original data. That means in the previous version, simply setting pre_embedding to the integrated embedding in scDEED() was not enough; instead, a separately prepared permuted embedding is necessary.

In addition, I also briefly went through the code in the Permuted() and the Distances.UMAP() functions, and I think that the DimReduc to call when computing UMAP should be correct (by setting reduction = pre_embedding).

Thank you so much for the new tutorial section dedicated to working with integrated datasets! I believe the community would benefit from this because many researchers have started to work on multiple datasets nowadays. I'm just wondering if you want to expand this section a little bit by including the results/visualizations from the example dataset ifnb (instead of simply saying that this code works; I think the results/visualization from an example dataset may help convince the others).

Best, Jason.

Hi Jason, Thanks for the suggestion; we have updated our tutorial to expand a bit more on the integration results.

Best, Christy

JSB-UCLA / scDEED

Integrated data: need to set reduction argument when computing UMAP #10