Read group in ema_wrapper.sh

vladsavelyev commented 6 years ago

This line does not work for me: https://github.com/arshajii/ema/blob/bce69b44a643ba12671973bfb0f503da36d4d79f/util/ema_wrapper.sh#L77

It gets translated into the following one and BWA fails with an error:

+ bwa mem -t 2 -M -R '@RG'\''\t'\''ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0'\''\t'\''SM:NA12878_WGS_v2' /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa bucket_no_bc/subset_L001_R1.no_bc.fastq bucket_no_bc/subset_L001_R2.no_bc.fastq
[E::bwa_set_rg] no ID within the read group line

However, I managed to fix it for me by removing the single quotes in the last bit of the replacement statement:

bwa mem -t "$t" -M -R "${R//$'\t'/\t}" "$r" bucket_no_bc/*1.no_bc.fastq bucket_no_bc/*2.no_bc.fastq > "bucket_no_bc/$OUTSAM"

(see "${R//$'\t'/\t}" instead of "${R//$'\t'/'\t'}")

I'm wondering if that might be bash version specific or something? Also, is it possible to improve the usage of the pipeline by removing the requirements to pass the read group as an argument explicitly?

vladsavelyev commented 6 years ago

However after removing the single quotes, getting error with Picard:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0 clashes with ID:ema. Line:
@PG     ID:ema  PN:ema  VN:0.1.0        CL:/home/vlad/bin/ema align -1 bucket000/subset_L001_R1.preproc.fastq -2 bucket000/subset_L001_R2.preproc.fastq -r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa -o bucket000/ema_out.sam -R @RG     ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0    SM:NA12878_WGS_v2; File /data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/
10k/bucket000/ema_out.sorted.bam; Line number 87

vladsavelyev commented 6 years ago

Though that CL: bit is added with ema align - which is before the bwa mem line, so does not come directly from me removing the quotes. However might be related to the same original issue.

Should that CL: part of @PG line be quoted/escaped?

arshajii commented 6 years ago

I think the easiest fix here is to just take the read group as an escaped string as BWA does. Just implemented that (i.e. both EMA and the wrapper script are updated to reflect this). Does this fix your issue? Specifically, you can now pass the read group as:

-R '@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2'

vladsavelyev commented 6 years ago

This error now, unfortunately:

[E::bwa_set_rg] the read group line contained literal <tab> characters -- replace with escaped tabs: \t

I managed to bypass this only by adding a step that just removes that CL: part for the line.

arshajii commented 6 years ago

Hmm, strange. Could you please send me the exact command line you're using?

vladsavelyev commented 6 years ago

EMAPATH=/home/vlad/bin/ema \
PICARDPATH=/data/cephfs/punim0010/extras/10x/miniconda/envs/10x/share/picard-2.17.11-0/picard.jar \
/data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/ema_wrapper.sh \
-r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa \
-R $'@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2' \
-t $THREADS

arshajii commented 6 years ago

Ah, try getting rid of that $. Specifically:

EMAPATH=/home/vlad/bin/ema \
PICARDPATH=/data/cephfs/punim0010/extras/10x/miniconda/envs/10x/share/picard-2.17.11-0/picard.jar \
/data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/ema_wrapper.sh \
-r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa \
-R '@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2' \
-t $THREADS

This is basically what I was referring to above, since I just updated EMA to accept escaped read group strings (like BWA). Only difference is that BWA explicitly forbids literal tabs in the read group strings.

vladsavelyev commented 6 years ago

Oh nice, it works now!

vladsavelyev commented 6 years ago

Fantastic. Might also worth fixing the ema_wrapper command line in the notebook to avoid confusion: https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb Closing the issue!

arshajii commented 6 years ago

Thanks! Fixed it there as well.

arshajii / ema

Read group in ema_wrapper.sh #6