arshajii / ema

Fast & accurate alignment of barcoded short-reads
http://ema.csail.mit.edu
MIT License
32 stars 7 forks source link

Interpreting read counts #29

Closed Asutu closed 4 years ago

Asutu commented 5 years ago

Hi,

I'm testing EMA on a set of 10x reads that are stored both in two separate fastQ files (R1 and R2). The number of reads in R1 is 9,000,743 and in R2 is, of course, the same, so in total we have 9,000,743 "read pairs" (or 18,001,486 total reads).

When I run count on the fastQ files, I get the following log:

$ parallel --will-cite -j 2 'paste <(gzip -cd {} | paste - - - -) <(gzip -cd {= s:_R1_:_R2_: =} | paste - - - -) | tr "\t" "\n" | ema count -w 4M-with-alts-february-2016.txt -o {/.}' ::: *_R1_*.gz >> count.log
:: Loading 10X took 2.0 s
:: Counting took 494.5 s
:: Reads with OK barcode: 8,953,372 out of 9,000,743
:: Ignored 0 reads
:: Dumped block 1
:: Printing took 0.5 s
:: Processed 9,000,743 reads (6,350 MB uncompressed) in 497 s

For what I understand, only R1 has the barcode at the start its sequence.

I think I'm just confused with the read counts logged by EMA. Would the above log message means that EMA correctly counted the ~9M barcodes from R1? or that it only processed ~9M of reads (half of the total read count)?

Many thanks, Pedro

arshajii commented 5 years ago

Sorry for the delay. I think EMA is doing the right thing here, it's just reporting "read pairs" as reads.

You could also check the preprocessed output. Our format has barcode, first mate and second mate all on a single line, so counting the number of outputted lines should tell you the number of read pairs.