Query regarding the total no. of cells in Seurat object identified after running SoloTE

vikaskumar1019 commented 2 weeks ago

Hello,

I tried the SoloTE on a published dataset but what I am observing is that after running the SoloTE on bam files and then making the Seurat object I am observing quite a variability in the total no. of cells compared to the published article. What can be the reasons for this, or how to justify this, I am following the tutorial provided. Also if you make clusters using these no. of cells how much the Differential expression of genes and TE gets affected.

Sample_Id Analysed Published Sample1 12728 5758 Sample2 20509 14844 Sample3 18732 6106 Sample4 28643 12324 Sample5 324 590 Sample6 2336 2494 Sample7 12932 9970 Sample8 22075 15975 Sample9 1644 1663 Sample10 3544 3697 Sample11 13354 9059 Sample12 6189 8233 Sample13 12627 6220 Sample14 7918 6156 Sample15 1027 2048 Sample16 50532 15141 Sample17 18743 10655 Sample18 12350 8191 Sample19 13476 4987 Sample20 21755 14090 Sample21 24249 11295

Thanks!

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 10 days with no activity.

bvaldebenitom commented 1 week ago

Hi @vikaskumar1019,

sorry for the delay in answering.

SoloTE will process all barcodes, and this often results in more cells than those in published works. What I would do is subset the SoloTE matrix to only the barcodes in the published dataset. These barcodes are often post-Quality Control analysis, and would represent valid cells. Hence, subsetting to those would allow us to study the net impact of TE expression, without adding as confounding variables the different cells that did not pass the filtering done in the original study.

Best, Braulio.

vikaskumar1019 commented 1 week ago

No worries! Thank you for your reply. Unfortunately I don't have the matrix file from the published results. They only provide the raw reads. Can you please suggest what I can do in this situation to get to an optimum results for TE expression as I am interested in that only. Any parameter which you think can be changed or used to get optimum result for TE expression using Seurat. Thanks!

bvaldebenitom commented 4 days ago

Hi @vikaskumar1019,

thanks for clarifying that information.

Based on what you say, you are not necessarily interested in matching the original results in terms of gene expression? From a scientific standpoint, I still think that the first step should be as a positive control, try to replicate the results in the original work in terms of gene expression.

As for TE-specific ideas, what I see in your numbers above, is that with SoloTE you are getting more cells. It could be the case that there might be some TE-only specific cell populations, which could be really interesting!

What I would do is:

Analyze the SoloTE matrix and get the number of locus-specific TEs. In some tests, I observed that locus-specific TEs were very sparse (i.e., expressed in only one or two cells, with a high number of them having this pattern), and thus, discarded them. A simple approach, would be to keep TEs expressed in at least 3 cells, with a minimum of 10 counts / UMIs per cell (although you can further fine-tune these parameters depending on your specific results).
Apply standard QC-metrics based on number of UMIs, features expressed per cell and mitochondrial percentages.

So basically, the first step would allow you to improve the signal to noise ratio, and after QC, you should do a preliminary clustering and see if you get a good clustering, and good markers per clusters. Depending on this result, you might need to iterate again through steps 1-2.

bvaldebenitom / SoloTE

Query regarding the total no. of cells in Seurat object identified after running SoloTE #57