broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
119 stars 34 forks source link

DGE matrix with bimodal distributions of total counts #415

Open ccuriqueo opened 4 months ago

ccuriqueo commented 4 months ago

Hello, I hope you can guide me. Download Fastqs from SRA (SRR9843421), it is sequence data from Microwell-seq. Use <fastq-dump --split-files -gz SRRXX> then I used the Drop-seq protocol you posted. The DGE matrix with the following code

./DigitalExpression I= /mnt/d/output_control/my_clean.bam O= /mnt/d/output_control/control.dge.txt.gz SUMMARY= /mnt/d/output_control/control.dge.summary.txt TMP_DIR= / mnt/d/input_control/ NUM_CORE_BARCODES=10000

And when I loaded this as adata_object I got something like this, my question is if I have to perform any previous steps with the Fastq files, or would it be enough to use the protocol directly, since when I start the quality control there are several counts that have been filtered and the library is reduced quite a bit.

I attach an image of how the adata object looks in gene by counts vs total counts

Captura de pantalla 2024-05-03 234548

jamesnemesh commented 4 months ago

Hi,

I'm not at all familiar with Microwell-seq, so if there are technical requirements there not met by dropseq I can't answer those questions. The original Microwell paper is pretty light on data processing details.

For the bimodal data you've posted, It's quite possible that when you're forcing extraction of 10K cell barcodes that you are extracting both cell barcodes that have captured cells, as well as cell barcodes that have only captured ambient RNA. Since the counts and number of genes are so correlated, I think it would make sense to plot a 1d density plot of total counts which should be bimodal (but give you a better sense of how many cell barcodes are in each mode).

Generally there's a cell selection step in most scRNASeq pipelines, which you might have to implement here. If you hadn't already done this, a more general approach would be to extract all cells with at least some number of transcripts (20 or 100) and repeat this plot, then select the distribution of cell barcodes with the higher number of counts.