Closed vladsavelyev closed 6 years ago
However after removing the single quotes, getting error with Picard:
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0 clashes with ID:ema. Line:
@PG ID:ema PN:ema VN:0.1.0 CL:/home/vlad/bin/ema align -1 bucket000/subset_L001_R1.preproc.fastq -2 bucket000/subset_L001_R2.preproc.fastq -r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa -o bucket000/ema_out.sam -R @RG ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0 SM:NA12878_WGS_v2; File /data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/
10k/bucket000/ema_out.sorted.bam; Line number 87
Though that CL:
bit is added with ema align
- which is before the bwa mem
line, so does not come directly from me removing the quotes. However might be related to the same original issue.
Should that CL:
part of @PG
line be quoted/escaped?
I think the easiest fix here is to just take the read group as an escaped string as BWA does. Just implemented that (i.e. both EMA and the wrapper script are updated to reflect this). Does this fix your issue? Specifically, you can now pass the read group as:
-R '@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2'
This error now, unfortunately:
[E::bwa_set_rg] the read group line contained literal <tab> characters -- replace with escaped tabs: \t
I managed to bypass this only by adding a step that just removes that CL:
part for the line.
Hmm, strange. Could you please send me the exact command line you're using?
EMAPATH=/home/vlad/bin/ema \
PICARDPATH=/data/cephfs/punim0010/extras/10x/miniconda/envs/10x/share/picard-2.17.11-0/picard.jar \
/data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/ema_wrapper.sh \
-r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa \
-R $'@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2' \
-t $THREADS
Ah, try getting rid of that $
. Specifically:
EMAPATH=/home/vlad/bin/ema \
PICARDPATH=/data/cephfs/punim0010/extras/10x/miniconda/envs/10x/share/picard-2.17.11-0/picard.jar \
/data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/ema_wrapper.sh \
-r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa \
-R '@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2' \
-t $THREADS
This is basically what I was referring to above, since I just updated EMA to accept escaped read group strings (like BWA). Only difference is that BWA explicitly forbids literal tabs in the read group strings.
Oh nice, it works now!
Fantastic. Might also worth fixing the ema_wrapper command line in the notebook to avoid confusion: https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb Closing the issue!
Thanks! Fixed it there as well.
This line does not work for me: https://github.com/arshajii/ema/blob/bce69b44a643ba12671973bfb0f503da36d4d79f/util/ema_wrapper.sh#L77
It gets translated into the following one and BWA fails with an error:
However, I managed to fix it for me by removing the single quotes in the last bit of the replacement statement:
(see
"${R//$'\t'/\t}"
instead of"${R//$'\t'/'\t'}"
)I'm wondering if that might be bash version specific or something? Also, is it possible to improve the usage of the pipeline by removing the requirements to pass the read group as an argument explicitly?