Open RNieuwenhuis opened 4 years ago
Can you please provide us with a small sample of the problematic file (e.g. first 100k records) so that we can see what is going on? We never seen files with the 7bp spacer before.
Thanks for your reply @inumanag ,
Yes so we start out with regular FATQ files as they come from the basecalling:
They are 151 bp long.
Then after running Longranger Basic for barcode whitelisting and error correction I obtain the following interleaved file:
Longranger_basic_interleaved.fastq.gz
Note that each R1 is now 128 bp long. This is because the 16 bp barcode + 7 bp spacer are stripped off. If a barcode is on the whitelist it will be added to both the fastq header e.g. BX:Z:AAGGTTCAGAGGTAGA-1
(This is one that is in the file).
After running sed '1~4 s/ BX:Z:\(.*\)-1/:\1/'
on this file I get this one, that I used as input for ema.
I hope this clears up the issue and you can provide some solution the my initial questions.
Regards,
Ronald
@inumanag Could you please post whether you read my reply and what your thoughts are on the matter? I would really like to try emerald.
Regards,
Hi @RNieuwenhuis -- Are you passing the output of Long Ranger's error correction to EMA? EMA implements the error correction pipeline itself, so you can pass the raw data directly to EMA.
Hi @arshajii
Are you passing the output of Long Ranger's error correction to EMA? EMA implements the error correction pipeline itself, so you can pass the raw data directly to EMA.
Yes that was what I did and I see now that is also not going to work. When using the raw 10X data it is working.
Maybe the Input formats section could use some clarification that the special fastq file is generated using ema preproc.
Furthermore @inumanag wrote:
We never seen files with the 7bp spacer before.
So what happens to that part of R1 in the alignment? Is it soft-clipped?
Regards,
I think I'm also seeing similar issues with EMA not recognizing most reads.
I started with the following FastQs (snippets below file-name):
@MN00867:6:000H2J7M3:1:11102:4619:1025 BX:Z:GGTTATATTTGTGTTCTT-1
CTGATGNCCGGGCATGGAAAAAAGCGNAACTNCGCAGCCATGGCNTTACGGTCGTGGAATATGAGCAAGAT
FFFFFF#FFFFFFFFFFFFFFFFFFF#FFFF#FFFFFFFFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:4619:1025 BX:Z:GGTTATATTTGTGTTCTT-1 TCAGCATCGACGATACGGCCTTGCTGGGCAAATTGCGCTTTAAGACGCTGGCCAGCGACGGAATACCCAAG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:12686:8143 BX:Z:GGTTATATTTGTGTTCTT-1 AGCGTCCTGCCGGTTTCAGGCGGTTTGCTGGGCCTTTCCGGTATTCTGGCAAGCCAGGCAGAGAGCGTTTG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA
sed 's/ BX:Z:/:/' mini_orig.fastq | sed 's/\-1//g' - > mini_ndsh.fastq
)
@MN00867:6:000H2J7M3:1:11102:4619:1025:GGTTATATTTGTGTTCTT
CTGATGNCCGGGCATGGAAAAAAGCGNAACTNCGCAGCCATGGCNTTACGGTCGTGGAATATGAGCAAGAT
FFFFFF#FFFFFFFFFFFFFFFFFFF#FFFF#FFFFFFFFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:4619:1025:GGTTATATTTGTGTTCTT TCAGCATCGACGATACGGCCTTGCTGGGCAAATTGCGCTTTAAGACGCTGGCCAGCGACGGAATACCCAAG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @MN00867:6:000H2J7M3:1:11102:12686:8143:GGTTATATTTGTGTTCTT AGCGTCCTGCCGGTTTCAGGCGGTTTGCTGGGCCTTTCCGGTATTCTGGCAAGCCAGGCAGAGAGCGTTTG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA
I ran the following commands:
cat mini_orig.fastq | /path/to/ema/ema count -w test.whitelist.txt -o ./mini_orig > mini_orig.log 2>&1
cat mini_ndsh.fastq | /path/to/ema/ema count -w test.whitelist.txt -o ./mini_ndsh > mini_ndsh.log 2>&1
The appropriate *.ema-*cnt
files are being generated, but ema count
is consistently recognizing 0 reads out of 50 paired-end reads. The following is the standard output for both of the above commands.
:: Loading 10X took 0.1 s
:: Counting took 0.0 s
:: Reads with OK barcode: 0 out of 50
:: Ignored 0 reads
:: Dumped block 1
:: Printing took 0.0 s
:: Processed 50 reads (0 MB uncompressed) in 0 s
If I expand the size of the test FastQ, it identifies 7 out of 50,000 reads. However, this is still far fewer barcodes than in whitelist. Any idea as to what's happening here? Thanks!
Dear author,
I would like to use Emerald to align my 10X genomics reads to a reference.
After reading the README and the issues page I am confused about the following things:
However, I did run Longranger basic on these. This produces an interleaved fastq file with corrected barcodes. These barcodes have been added to the header, if they are valid, using the BX:Z: tag.
I used sed on this interleaved fastq file, as you advised in one of the issue threads: zcat barcoded.fastq.gz | sed '1~4 s/ BX:Z:(.*)-1/:\1/' | ema count -w ~/bin/longranger-2.2.2/longranger-cs/2.2.2/tenkit/lib/python/tenkit/barcodes/4M-with-alts-february-2016.txt -o my_sample_count 2> Ema_count.log
The log for this shows the following:
That seems not OK as around 95% of the reads has a valid barcode.
This brings me to the following point..
Are both ema count and preprocess needed if you already have a barcode corrected interleaved fastq file? If so, how should they be run?
Any help would be much appreciated.
Regards,
Ronald