Input and workflow unclear

RNieuwenhuis commented 4 years ago

Dear author,

I would like to use Emerald to align my 10X genomics reads to a reference.

After reading the README and the issues page I am confused about the following things:

What input it needed? After 10X sequencing I obtained and R1 and R2 file. The R1 file has the uncorrected 16 bp barcode+ 7 bp spacer.

However, I did run Longranger basic on these. This produces an interleaved fastq file with corrected barcodes. These barcodes have been added to the header, if they are valid, using the BX:Z: tag.

I used sed on this interleaved fastq file, as you advised in one of the issue threads: zcat barcoded.fastq.gz | sed '1~4 s/ BX:Z:(.*)-1/:\1/' | ema count -w ~/bin/longranger-2.2.2/longranger-cs/2.2.2/tenkit/lib/python/tenkit/barcodes/4M-with-alts-february-2016.txt -o my_sample_count 2> Ema_count.log

The log for this shows the following:

:: Loading 10X took 3.2 s
:: Dumped block 1
:: Dumped block 2
:: Dumped block 3
:: Dumped block 4
:: Dumped block 5
:: Dumped block 6
:: Dumped block 7
:: Dumped block 8
:: Dumped block 9
:: Dumped block 10
:: Dumped block 11
:: Dumped block 12
:: Dumped block 13
:: Dumped block 14
:: Dumped block 15
:: Counting took 2247.3 s
:: Reads with OK barcode: 289,882 out of 306,670,720
:: Ignored 0 reads
:: Dumped block 16
:: Printing took 9.7 s
:: Processed 306,670,720 reads (198,128 MB uncompressed) in 2,260 s

That seems not OK as around 95% of the reads has a valid barcode.

This brings me to the following point..

What steps are needed to align the reads?

Are both ema count and preprocess needed if you already have a barcode corrected interleaved fastq file? If so, how should they be run?

What does RA.gz refer to in the README?

Any help would be much appreciated.

Regards,

Ronald

inumanag commented 4 years ago

Can you please provide us with a small sample of the problematic file (e.g. first 100k records) so that we can see what is going on? We never seen files with the 7bp spacer before.

RNieuwenhuis commented 4 years ago

Thanks for your reply @inumanag ,

Yes so we start out with regular FATQ files as they come from the basecalling:

R1.fastq.gz R2.fastq.gz

They are 151 bp long.

Then after running Longranger Basic for barcode whitelisting and error correction I obtain the following interleaved file:

Longranger_basic_interleaved.fastq.gz

Note that each R1 is now 128 bp long. This is because the 16 bp barcode + 7 bp spacer are stripped off. If a barcode is on the whitelist it will be added to both the fastq header e.g. BX:Z:AAGGTTCAGAGGTAGA-1 (This is one that is in the file).

After running sed '1~4 s/ BX:Z:\(.*\)-1/:\1/' on this file I get this one, that I used as input for ema.

Ema_input.fastq.gz

I hope this clears up the issue and you can provide some solution the my initial questions.

Regards,

Ronald

RNieuwenhuis commented 4 years ago

@inumanag Could you please post whether you read my reply and what your thoughts are on the matter? I would really like to try emerald.

Regards,

arshajii commented 4 years ago

Hi @RNieuwenhuis -- Are you passing the output of Long Ranger's error correction to EMA? EMA implements the error correction pipeline itself, so you can pass the raw data directly to EMA.

RNieuwenhuis commented 4 years ago

Hi @arshajii

Are you passing the output of Long Ranger's error correction to EMA? EMA implements the error correction pipeline itself, so you can pass the raw data directly to EMA.

Yes that was what I did and I see now that is also not going to work. When using the raw 10X data it is working.

Maybe the Input formats section could use some clarification that the special fastq file is generated using ema preproc.

Furthermore @inumanag wrote:

We never seen files with the 7bp spacer before.

So what happens to that part of R1 in the alignment? Is it soft-clipped?

Regards,

lauren-mak commented 3 years ago

I think I'm also seeing similar issues with EMA not recognizing most reads.

I started with the following FastQs (snippets below file-name):

mini_orig.fastq.txt


@MN00867:6:000H2J7M3:1:11102:4619:1025 BX:Z:GGTTATATTTGTGTTCTT-1
CTGATGNCCGGGCATGGAAAAAAGCGNAACTNCGCAGCCATGGCNTTACGGTCGTGGAATATGAGCAAGAT

FFFFFF#FFFFFFFFFFFFFFFFFFF#FFFF#FFFFFFFFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:4619:1025 BX:Z:GGTTATATTTGTGTTCTT-1 TCAGCATCGACGATACGGCCTTGCTGGGCAAATTGCGCTTTAAGACGCTGGCCAGCGACGGAATACCCAAG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:12686:8143 BX:Z:GGTTATATTTGTGTTCTT-1 AGCGTCCTGCCGGTTTCAGGCGGTTTGCTGGGCCTTTCCGGTATTCTGGCAAGCCAGGCAGAGAGCGTTTG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA

mini_ndsh.fastq.txt, which was generated from the above with sed 's/ BX:Z:/:/' mini_orig.fastq | sed 's/\-1//g' - > mini_ndsh.fastq)


@MN00867:6:000H2J7M3:1:11102:4619:1025:GGTTATATTTGTGTTCTT
CTGATGNCCGGGCATGGAAAAAAGCGNAACTNCGCAGCCATGGCNTTACGGTCGTGGAATATGAGCAAGAT

FFFFFF#FFFFFFFFFFFFFFFFFFF#FFFF#FFFFFFFFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:4619:1025:GGTTATATTTGTGTTCTT TCAGCATCGACGATACGGCCTTGCTGGGCAAATTGCGCTTTAAGACGCTGGCCAGCGACGGAATACCCAAG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:12686:8143:GGTTATATTTGTGTTCTT AGCGTCCTGCCGGTTTCAGGCGGTTTGCTGGGCCTTTCCGGTATTCTGGCAAGCCAGGCAGAGAGCGTTTG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA
test.whitelist.txt, a custom whitelist of 10x-like barcodes, contains all of the unique barcodes in the full-sized FastQ) Unfortunately, I don't have the raw BCL files and I'm not sure what 10x reagents or LongRanger commands were used to generate them, but as far as I can tell, the interleaved FastQs are fairly normal 10x outputs.

I ran the following commands:

cat mini_orig.fastq | /path/to/ema/ema count -w test.whitelist.txt -o ./mini_orig > mini_orig.log 2>&1
cat mini_ndsh.fastq | /path/to/ema/ema count -w test.whitelist.txt -o ./mini_ndsh > mini_ndsh.log 2>&1

The appropriate *.ema-*cnt files are being generated, but ema count is consistently recognizing 0 reads out of 50 paired-end reads. The following is the standard output for both of the above commands.

:: Loading 10X took 0.1 s
:: Counting took 0.0 s
:: Reads with OK barcode: 0 out of 50
:: Ignored 0 reads
:: Dumped block 1
:: Printing took 0.0 s
:: Processed 50 reads (0 MB uncompressed) in 0 s

If I expand the size of the test FastQ, it identifies 7 out of 50,000 reads. However, this is still far fewer barcodes than in whitelist. Any idea as to what's happening here? Thanks!

arshajii / ema

Input and workflow unclear #36