bvaldebenitom / SoloTE

GNU General Public License v3.0
27 stars 6 forks source link

Much more cells were retained in the matrix #43

Closed XiaoyuZhan520 closed 3 months ago

XiaoyuZhan520 commented 5 months ago

Hello,

Thanks for your work in SoloTE!

I quantified TE expression with scTE and soloTE. However, the cells number in soloTE (121691) is much more that that in scTE (10000). Even if I change the parameter '--minoverlap' from 5 to 40, the number didn't change too much. I am wondering if I miss any information I should pay attention to?

bvaldebenitom commented 5 months ago

Hi @XiaoyuZhan520,

thanks for using the tool.

You are not missing any information. If I recall correctly, scTE has some options regarding the expected number of cells and/or cell filtering. On the other hand, SoloTE just processes all the barcodes appearing in the CB tags of the BAM files.

The --minoverlap option in SoloTE is related to the minimum number of bases that a read must have overlapping with a TE region in order to quantify it towards that TE.

The main consideration for SoloTE is that once the matrix is generated, you load it into your preferred analysis tool (Seurat, Scanpy, etc.), and then you subset it to a set of reference barcodes. If you have already analyzed your data, we suggest to used the final set of barcodes from that analysis. If you haven't, using the barcodes in the "filtered" result of CellRanger is a good starting point.

Please let me know if this solves your question and/or if you need help doing the subset step.

Best, Braulio.

XiaoyuZhan520 commented 4 months ago

Thanks for your information! It is very helpful!

Currently, I am subsetting the cells according to the cell kept in Seurat analysis with certain expression in protein-coding genes, although I am sure that some cells with high TE expression and low protein-coding gene expression will be filtered.

Besides, I am wondering if you have any suggestion about --minoverlap option setting. The minimum length of my interested TE is 100bp, considering the read length of scRNA-seq is around 90bp, I think --minoverlap=20 may be more appropriate? I will be grateful if you have any suggestion? Many thanks in advance!

bvaldebenitom commented 4 months ago

@XiaoyuZhan520 you are welcome!

Actually, you have a very good point that there could be TE-only (i.e., high TE expression, low gene expression) cells removed! In this case, I suggest starting from the SoloTE matrix and applying similar QC filtering to it. You can follow the idea from Chang et al. (https://genome.cshlp.org/content/32/7/1408.full) in which they modified thresholds based on the contributions of TEs to the transcriptome.

Regarding the --minoverlap option, I agree that setting it at 20 might be more appropriate for your question, though I have noticed in some test cases that TEs could provide a different splice site located towards their end, and thus, having overlap between ~10 bp. With that said, you can either leave this option with the default and do filtering afterwards based on length, or just set it at 20, or even 50. I would prefer the first option, as it will allow you to count the most number of reads to TEs. You can then check using bedtools coverage how much of the TE is covered by reads and do filtering of TEs based on that. If you prefer to apply this option, let me know if you need further help with it.

I'll make some update in the README illustrating these points further.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 10 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.