arshajii / ema

Fast & accurate alignment of barcoded short-reads
http://ema.csail.mit.edu
MIT License
32 stars 7 forks source link

Mapping Longranger basic FASTQ format with BX:Z tag #15

Open sjackman opened 6 years ago

sjackman commented 6 years ago

Does EMA support mapping reads in the Longranger basic FASTQ format, where the corrected barcode is in the BX:Z tag of the FASTQ header?

@E00316:145:HMVG3CCXX:5:1214:6786:3014 BX:Z:AAACACCAGACAATAC-1

Does EMA support mapping reads whose barcodes have not yet been separated from read 1? That is, the uncorrected barcode is the first 16 nucleotides of read 1?

arshajii commented 6 years ago

Currently, we don't support that FASTQ format, but it's quite similar to our old input format so it probably wouldn't be hard to add support for it.

ema align can't map reads with the barcode still attached, but they can be preprocessed using ema preproc into a mappable form (barcode extracted, trimmed, grouped by barcode, etc.). Instructions for this are in the README, but let me know if you have any questions or encounter any issues.

sjackman commented 6 years ago

Sounds liked sed 's/ BX:Z:/:/ would convert from Longranger basic FASTQ format to the format that EMA expects. Would you consider adding support for BX:Z format?

sjackman commented 6 years ago
preproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)
  -h: apply Hamming-2 correction [off]

Cool! That's useful to me. I'm curious why -h is disabled by default. I would have expected that Hamming correction of the barcode sequence would help quite a bit. Have you not found that to be the case?

sjackman commented 6 years ago

What is the output format of ema preproc? Would you consider adding an option to output BX:Z format?

arshajii commented 6 years ago

Sure, I don't think it'd be hard to support BX:Z; I'll look into it in a bit.

Hamming-2 correction is off by default because it affects almost no reads (something like a dozen out of a billion or so), but takes a substantial amount of time. (Long Ranger does do H2 correction by default, though, so you can turn it on if you want 100% compatibility.)

ema preproc produces a special output format that isn't quite FASTQ; it puts everything for a read pair on a single line, which is convenient.

sjackman commented 6 years ago

Ah, I think I misunderstood. The default then is Hamming-1 correction? I had incorrectly assumed that the default is no correction. Perhaps you could update the README.md to clarify which.

sjackman commented 6 years ago

ema preproc produces a special output format that isn't quite FASTQ; it puts everything for a read pair on a single line, which is convenient.

That format is sometimes useful, and I have used similar formats before. Could it optionally output BX:Z formatted interleaved FASTQ? That's the format output by longranger basic.

arshajii commented 6 years ago

Yes I think that's definitely doable.

Question: Are those Long Ranger FASTQs sorted/grouped by barcode?

sjackman commented 6 years ago

Yes, the outs/barcoded.fastq.gz of longranger basic does appear to be sorted by BX:Z! I hadn't noticed that before! I've been running samtools sort -tBX to sort by barcode! Hah. Thanks for making me look.

sjackman commented 6 years ago

Does ema preproc sort by barcode by default? Assuming not, can the output of ema preproc be piped into samtools sort -tBX?

arshajii commented 6 years ago

This should work now; can you try running on your Long Ranger formatted FASTQs? You can input them with -1/-2 (or just -1 if interleaved).

To answer your question, ema preproc does not sort by barcode, but instead produces barcode bins small enough that ema align can read them entirely into memory and sort them itself.

inumanag commented 6 years ago

Usually, I would highly recommend that you apply our pipeline from scratch--- the EMA preprocessing is really fast right now and should take you no more than 1hr.

If you really need to use Long Ranger interleaved fastqs, you can use GNU sed to cast them to EMA-supported FASTQ (given that they are sorted by barcode):

cat interleaved-fastq | sed '1~4 s/ BX:Z:\(.*\)-1/:\1/' | ema ... 
arshajii commented 6 years ago

Just wanted to check back and see if you had success with this; please let us know!

sjackman commented 6 years ago

I haven't got to testing EMA just yet, and likely won't until after RECOMB. I'll let you know when I do though!

sjackman commented 6 years ago

I'd like to recommend using ema preproc to decode barcodes for our linked read tools, ABySS, ARCS, and Tigmint. These tools expect BX:Z FASTQ files. Do you think you may find the time to add FASTQ output format to ema preproc in the near future?

inumanag commented 6 years ago

This should be doable now— try adding -b to ema preproc to get BX:Z FASTQs. Note that this will not sort the output files by BX or anything else: you can use other sorting tools for this (e.g. system sort). We keep it this way since EMA does in-memory sort in align stage in order to waste less time in disk I/O.

Please let me know if this works as expected. We can also add some sort function if that is really needed.

sjackman commented 6 years ago

That's excellent! Thanks, Ibrahim. I'll try it out and get back to you.

sjackman commented 6 years ago

ema preproc -b works! Thanks, Ibrahim!

One minor bug. When a read does not have a barcode assigned, the first read correctly outputs no barcode, but the second read outputs BX:Z:-1. Example:

@II_410699_410537_1_0_0_0_0:0:0_0:0:0_ba1/1
AAAAAGAACTCGGGAAACAATAAGAACCATACATGTGCGATAGAAGAGTGTTGCAGGAGTAGCTTTGGTTTTGTTCGAGAACTGGTAATGTTTGACTAATACATAATTGTACCGTGCTCAATCATATC
+
GEIGDFFEEGCFEEBBCDCEC>>CCCC@BBBBBBBCABA@ACECACAB@DAAF@@@@>?A@@@@B??<>?>>B=?A@?AA>?????>@>?B>@:@A=@><>>@>B@@@>@>?==<:=>===<=><;==
@II_410699_410537_1_0_0_0_0:0:0_0:0:0_ba1/2 BX:Z:-1
CAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAA
+
IIFIGHEGCGFFEGEEEEEFDIEDDDCCCCCCCBEDBB=DBBDCB@CBCDA?AA?B<ABEABDA@@C?@?A@B?@ABA@B<@?AB??>?>@?>C?????>A@@A><??BB>=>?>?C>?<>>>?@=@@>>>>@B=>>>@==?==>?=@==<

A usability request, if you'd consider it. I'd like to be able to run without first running ema count, and writing to standard out rather than to files in the directory specified by -o. A single bin of barcodes -n1 is okay in this case.

gunzip -c input.fq.gz | ema preproc -t16 -b -n1 -w 4M-with-alts-february-2016.txt | pigz >output.fq.gz
sjackman commented 6 years ago

One more small request. I'd like to be able to specify the numeric suffix of the BX:Z tag when I have multiple Chromium libraries from the same sample (needed for genomes larger than human). So for example BX:Z:GGGATGAGTGCAGAG-2 with a suffix of -2 rather than the usual -1 to indicate that it's from the second Chromium library.

sunnycqcn commented 6 years ago

Hello Sjackman, Are you sure you can run using the code you said? I did not. I met the same situation. Thanks.

sjackman commented 6 years ago

The -b option of ema preproc is not yet in a released version. You'll need to compile the master branch from GitHub. With the Homebrew on macOS or Linuxbrew on Linux package manager, you can use…

brew install --HEAD brewsci/bio/ema
sunnycqcn commented 6 years ago

Hello, After install, can I use you command and then the bam file use for ARCS? I am much appreciated for you help. Best, Fuyou

On Fri, Jun 22, 2018 at 3:15 PM, Shaun Jackman notifications@github.com wrote:

The -b option of ema preproc is not yet in a released version. You'll need to compile the master branch from GitHub. With the Homebrew https://brew.sh on macOS or Linuxbrew https://linuxbrew.sh on Linux package manager, you can use…

brew install --HEAD brewsci/bio/ema

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arshajii/ema/issues/15#issuecomment-399584325, or mute the thread https://github.com/notifications/unsubscribe-auth/AXaRKBASqGZ_XOwTRR0k6j9FwAeNEgZYks5t_V6GgaJpZM4S7lK3 .

-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA

sjackman commented 6 years ago

The above ema preproc -b extracts the barcodes from the read sequence add puts it in the BX:Z tag of a FASTQ (not BAM) file. You can then use that unaligned FASTQ file with ARKS https://github.com/bcgsc/arks (which includes light-weight alignment) or you can align the reads with bwa mem -p -C to create a SAM/BAM file which you can use with ARCS https://github.com/bcgsc/arcs.

Alternatively you can use EMA to produce the alignments. We're in the process of testing that, but it should work.

sjackman commented 6 years ago

ema preproc -b is faster alternative to longranger basic.

inumanag commented 6 years ago

Hi Shaun,

We can look at these maybe in 2 weeks--- both of us have no time till then. Will that work for you?

Thanks Ibrahim

sjackman commented 6 years ago

Hi, Ibarhim. I'm on vacation from mid-July to early August. We can discuss it further in August.

sjackman commented 6 years ago

I'm trying this feature out now! First attempt

ema: src/techs.c:8: extract_bc_10x: Assertion `bc_str != NULL' failed.

longranger basic includes reads without barcodes, but they won't have a BX:Z tag. EMA could either align or ignore these reads. As a hack workaround in the mean time, I'm filtering the reads to exclude reads with BX:Z tags.

gunzip -c reads.fq.gz | paste - - - - - - - - | grep "BX:Z:" | tr '\t' '\n' \
| ema align -t16 -r draft.fa -1 /dev/stdin >reads.sam
sjackman commented 6 years ago

It appears to have worked! I see BX:Z tags in the alignments, but I don't see any MI:i tags. Is there an option to enable MI:i tags?

EMA outputs a standard SAM file with several additional tags: MI: cloud identifier (compatible with Long Ranger)

https://github.com/arshajii/ema#output

arshajii commented 5 years ago

Hi Shaun,

Sorry for the massive delay -- getting back on this now. Can you post a line of your SAM output? MI is always printed (as far as I can tell) so I want to check what's going wrong based on the output.

Thanks

sjackman commented 5 years ago

No worries, Ariya. Here's the first alignment.

@PG     ID:ema  PN:ema  VN:0.6.1        CL:ema align -t16 -r foo.fa -1 /dev/stdin
…
HISEQX5:19:HYN5VCCXX:4:2223:19664:49338_GCAGTTACAGGTCTGC    99  contig_1    1   0   55S73M  =   111 192 ATGTGCTCGCAAGAGTTGGAATGAAACCGGTTGTATTTCCAATGCCTTTATAACTCTCTCTCTAGGTGTAAATATAGGTGCAGAACCTACGGAGTCCCCAGCAGTTCCAGGTGAAGTTTATCCAGACG    JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ7FJJJJJJJJJJJJJJJJJJJJFFJJJFJJJJJJ    NM:i:0  BX:Z:GCAGTTACAGGTCTGC-1 XG:f:1  XC:i:149820 XF:i:0  RG:Z:rg1
rchikhi commented 5 years ago

FYI I ran the command given by Shaun in https://github.com/arshajii/ema/issues/15#issuecomment-417517067 on another dataset and I do see MI tags

sjackman commented 5 years ago

I've tested it just now with 0.6.2, and it works! I see that before I was using 0.6.1. Perhaps the version was the cause of my trouble? In any case. Yeah! It works! Could EMA either align reads without a barcode tag or ignore them?