Open sjackman opened 6 years ago
Currently, we don't support that FASTQ format, but it's quite similar to our old input format so it probably wouldn't be hard to add support for it.
ema align
can't map reads with the barcode still attached, but they can be preprocessed using ema preproc
into a mappable form (barcode extracted, trimmed, grouped by barcode, etc.). Instructions for this are in the README, but let me know if you have any questions or encounter any issues.
Sounds liked sed 's/ BX:Z:/:/
would convert from Longranger basic FASTQ format to the format that EMA expects. Would you consider adding support for BX:Z
format?
preproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)
-h: apply Hamming-2 correction [off]
Cool! That's useful to me. I'm curious why -h
is disabled by default. I would have expected that Hamming correction of the barcode sequence would help quite a bit. Have you not found that to be the case?
What is the output format of ema preproc
? Would you consider adding an option to output BX:Z
format?
Sure, I don't think it'd be hard to support BX:Z
; I'll look into it in a bit.
Hamming-2 correction is off by default because it affects almost no reads (something like a dozen out of a billion or so), but takes a substantial amount of time. (Long Ranger does do H2 correction by default, though, so you can turn it on if you want 100% compatibility.)
ema preproc
produces a special output format that isn't quite FASTQ; it puts everything for a read pair on a single line, which is convenient.
Ah, I think I misunderstood. The default then is Hamming-1 correction? I had incorrectly assumed that the default is no correction. Perhaps you could update the README.md to clarify which.
ema preproc produces a special output format that isn't quite FASTQ; it puts everything for a read pair on a single line, which is convenient.
That format is sometimes useful, and I have used similar formats before. Could it optionally output BX:Z
formatted interleaved FASTQ? That's the format output by longranger basic.
Yes I think that's definitely doable.
Question: Are those Long Ranger FASTQs sorted/grouped by barcode?
Yes, the outs/barcoded.fastq.gz
of longranger basic
does appear to be sorted by BX:Z
! I hadn't noticed that before! I've been running samtools sort -tBX
to sort by barcode! Hah. Thanks for making me look.
Does ema preproc
sort by barcode by default? Assuming not, can the output of ema preproc
be piped into samtools sort -tBX
?
This should work now; can you try running on your Long Ranger formatted FASTQs? You can input them with -1
/-2
(or just -1
if interleaved).
To answer your question, ema preproc
does not sort by barcode, but instead produces barcode bins small enough that ema align
can read them entirely into memory and sort them itself.
Usually, I would highly recommend that you apply our pipeline from scratch--- the EMA preprocessing is really fast right now and should take you no more than 1hr.
If you really need to use Long Ranger interleaved fastqs, you can use GNU sed to cast them to EMA-supported FASTQ (given that they are sorted by barcode):
cat interleaved-fastq | sed '1~4 s/ BX:Z:\(.*\)-1/:\1/' | ema ...
Just wanted to check back and see if you had success with this; please let us know!
I haven't got to testing EMA just yet, and likely won't until after RECOMB. I'll let you know when I do though!
This should be doable now— try adding -b
to ema preproc
to get BX:Z FASTQs. Note that this will not sort the output files by BX or anything else: you can use other sorting tools for this (e.g. system sort). We keep it this way since EMA does in-memory sort in align
stage in order to waste less time in disk I/O.
Please let me know if this works as expected. We can also add some sort function if that is really needed.
That's excellent! Thanks, Ibrahim. I'll try it out and get back to you.
ema preproc -b
works! Thanks, Ibrahim!
One minor bug. When a read does not have a barcode assigned, the first read correctly outputs no barcode, but the second read outputs BX:Z:-1
. Example:
@II_410699_410537_1_0_0_0_0:0:0_0:0:0_ba1/1
AAAAAGAACTCGGGAAACAATAAGAACCATACATGTGCGATAGAAGAGTGTTGCAGGAGTAGCTTTGGTTTTGTTCGAGAACTGGTAATGTTTGACTAATACATAATTGTACCGTGCTCAATCATATC
+
GEIGDFFEEGCFEEBBCDCEC>>CCCC@BBBBBBBCABA@ACECACAB@DAAF@@@@>?A@@@@B??<>?>>B=?A@?AA>?????>@>?B>@:@A=@><>>@>B@@@>@>?==<:=>===<=><;==
@II_410699_410537_1_0_0_0_0:0:0_0:0:0_ba1/2 BX:Z:-1
CAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAA
+
IIFIGHEGCGFFEGEEEEEFDIEDDDCCCCCCCBEDBB=DBBDCB@CBCDA?AA?B<ABEABDA@@C?@?A@B?@ABA@B<@?AB??>?>@?>C?????>A@@A><??BB>=>?>?C>?<>>>?@=@@>>>>@B=>>>@==?==>?=@==<
A usability request, if you'd consider it. I'd like to be able to run without first running ema count
, and writing to standard out rather than to files in the directory specified by -o
. A single bin of barcodes -n1
is okay in this case.
gunzip -c input.fq.gz | ema preproc -t16 -b -n1 -w 4M-with-alts-february-2016.txt | pigz >output.fq.gz
One more small request. I'd like to be able to specify the numeric suffix of the BX:Z
tag when I have multiple Chromium libraries from the same sample (needed for genomes larger than human). So for example BX:Z:GGGATGAGTGCAGAG-2
with a suffix of -2
rather than the usual -1
to indicate that it's from the second Chromium library.
Hello Sjackman, Are you sure you can run using the code you said? I did not. I met the same situation. Thanks.
Hello, After install, can I use you command and then the bam file use for ARCS? I am much appreciated for you help. Best, Fuyou
On Fri, Jun 22, 2018 at 3:15 PM, Shaun Jackman notifications@github.com wrote:
The -b option of ema preproc is not yet in a released version. You'll need to compile the master branch from GitHub. With the Homebrew https://brew.sh on macOS or Linuxbrew https://linuxbrew.sh on Linux package manager, you can use…
brew install --HEAD brewsci/bio/ema
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arshajii/ema/issues/15#issuecomment-399584325, or mute the thread https://github.com/notifications/unsubscribe-auth/AXaRKBASqGZ_XOwTRR0k6j9FwAeNEgZYks5t_V6GgaJpZM4S7lK3 .
-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA
The above ema preproc -b
extracts the barcodes from the read sequence add puts it in the BX:Z
tag of a FASTQ (not BAM) file. You can then use that unaligned FASTQ file with ARKS https://github.com/bcgsc/arks (which includes light-weight alignment) or you can align the reads with bwa mem -p -C
to create a SAM/BAM file which you can use with ARCS https://github.com/bcgsc/arcs.
Alternatively you can use EMA to produce the alignments. We're in the process of testing that, but it should work.
ema preproc -b
is faster alternative to longranger basic
.
Hi Shaun,
We can look at these maybe in 2 weeks--- both of us have no time till then. Will that work for you?
Thanks Ibrahim
Hi, Ibarhim. I'm on vacation from mid-July to early August. We can discuss it further in August.
I'm trying this feature out now! First attempt
ema: src/techs.c:8: extract_bc_10x: Assertion `bc_str != NULL' failed.
longranger basic
includes reads without barcodes, but they won't have a BX:Z
tag. EMA could either align or ignore these reads. As a hack workaround in the mean time, I'm filtering the reads to exclude reads with BX:Z
tags.
gunzip -c reads.fq.gz | paste - - - - - - - - | grep "BX:Z:" | tr '\t' '\n' \
| ema align -t16 -r draft.fa -1 /dev/stdin >reads.sam
It appears to have worked! I see BX:Z
tags in the alignments, but I don't see any MI:i
tags. Is there an option to enable MI:i
tags?
EMA outputs a standard SAM file with several additional tags: MI: cloud identifier (compatible with Long Ranger)
Hi Shaun,
Sorry for the massive delay -- getting back on this now. Can you post a line of your SAM output? MI is always printed (as far as I can tell) so I want to check what's going wrong based on the output.
Thanks
No worries, Ariya. Here's the first alignment.
@PG ID:ema PN:ema VN:0.6.1 CL:ema align -t16 -r foo.fa -1 /dev/stdin
…
HISEQX5:19:HYN5VCCXX:4:2223:19664:49338_GCAGTTACAGGTCTGC 99 contig_1 1 0 55S73M = 111 192 ATGTGCTCGCAAGAGTTGGAATGAAACCGGTTGTATTTCCAATGCCTTTATAACTCTCTCTCTAGGTGTAAATATAGGTGCAGAACCTACGGAGTCCCCAGCAGTTCCAGGTGAAGTTTATCCAGACG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ7FJJJJJJJJJJJJJJJJJJJJFFJJJFJJJJJJ NM:i:0 BX:Z:GCAGTTACAGGTCTGC-1 XG:f:1 XC:i:149820 XF:i:0 RG:Z:rg1
FYI I ran the command given by Shaun in https://github.com/arshajii/ema/issues/15#issuecomment-417517067 on another dataset and I do see MI tags
I've tested it just now with 0.6.2, and it works! I see that before I was using 0.6.1. Perhaps the version was the cause of my trouble? In any case. Yeah! It works! Could EMA either align reads without a barcode tag or ignore them?
Does EMA support mapping reads in the Longranger basic FASTQ format, where the corrected barcode is in the BX:Z tag of the FASTQ header?
Does EMA support mapping reads whose barcodes have not yet been separated from read 1? That is, the uncorrected barcode is the first 16 nucleotides of read 1?