ema align output SAM parsing error

pontushojer commented 4 years ago

I have been running ema (version 0.6.2) on reads in the longranger basic FASTQ format (BX:Z in header). I pipe the output directly to samtools sort. My command looks something like this.

ema align -1 <(pigz -cd in.1.fastq.gz) -2 <(pigz -cd in.2.fastq.gz) -r genome.fa -t 20 -p 10x 2> mapping.log | samtools sort - -@ 20 -o mapping.bam 2> sorting.log

Mostly this have been working fine but every other run have failed because samtools sort gets a parsing error

Either this one

[E::sam_parse1] unrecognized CIGAR operator

Or this one

[E::sam_parse1] CIGAR and query sequence are of different length

Both are related to the CIGAR so I think there is some formatting error here that creeps in every now and then. I looked SAM entries that caused the error in the one of the runs and found some strange things. Below are four lines, of which the third (marked in bold) is causing the error.

ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG 83  chr1    146535682   3   66M =   146535535   -213    GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC  JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ  NM:i:0  BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:0.5    MI:i:-1470287527    XF:i:0  RG:Z:4  XA:Z:chr1,+148197632,66M,0;
ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG 163 chr1    146535535   3   151M    =   146535682   213 TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ NM:i:0  BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:0.5    MI:i:-1470287527    XF:i:0  RG:Z:4  XA:Z:chr1,-148197694,151M,0;
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG  89  chr2    158084867   48  4214864^@   *   0   0   GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC  JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ  NM:i:4  BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:1  MI:i:-1470287525    XF:i:0  RG:Z:4
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG 165 *   0   0   *   chr2    158084867   0   TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ BX:Z:AAAAAAAAAAAAAAAA-1 RG:Z:4

As you can see this SAM entry is just plain strange. It aslo contains a ^@ character for some reason. Interestingly the SEQ and QUAL strings for this entry does not below to the read with the QNAME ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG instead it belongs to the first entry in my example named ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG. This is also the read-pair just before the one causing the error in my FASTQ.

I have tried to replicate the error on a smaller subset of my data but have so far been unsuccessful. For example if i take the FASTQ entries corresponding to the failed SAM entries shown above I don't get any error. So somehow this only happens when running the full dataset.

Do you have any idea what could be causing this?

arshajii commented 4 years ago

Hi @pontushojer, Sorry for the delay in responding. This is quite strange, and looks like a bug that only comes up when processing a large set of reads.

Couple quick questions:

Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?
How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

pontushojer commented 4 years ago

Hi @arshajii,

No worries!

Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?

I am running a pre-built version from conda, version 0.6.2 build h8b12597_1.

How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

The datasets have been about 400-500 M read-pairs, so far I have had issues on about three of my dataset.

I have so far been unable to generate a smaller dataset to replicate the issue. If I extract the read-pairs for barcodes surrounding the entry that causes the error in the full dataset, it completes without error. I will continue to try and generate a subset, as you say it would help narrowing this down.

I can check about sending a full dataset...

pontushojer commented 4 years ago

@arshajii I have now managed to generate a smaller subset that can recreate the issue.

Running the following:

ema align -1 <(pigz -cd failing.fastq.gz) -R '@RG\tID:1\tSM:20\tPU:unit1\tPL:ILLUMINA' -r genome.fa -t 4 -p 10x 2> mapping.log | samtools sort - -@ 4 -o out.bam -O BAM -l 0 2> sorting.log

outputs this to the sorting.log:

[E::sam_parse1] CIGAR and query sequence are of different length
samtools sort: truncated file. Aborting

If I skip the pipe to samtools sort and look at the unsorted file from ema I find two faulty entries:

A00187:292:H7G2JDSXY:3:2674:9516:18662:TTTTTTTGTAAGGAACTGAA 73  chrX    147384013   60  8006656M    *   0   0   ATAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  F:FF,F:F,FFF,:FFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFF,FFFFFF,,FFFFFF  NM:i:0  BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:1  MI:i:1620035    XF:i:0  RG:Z:1
...
A00187:292:H7G2JDSXY:3:1271:24126:15405:AAAAAAAAAAAAATAAAAAA    73  chr3    26999354    22  46139657M691D46137351c691D  *   0   0   CCCCCTCATTGTCCTTGTCTATTACATTTTTATTTTTATATTATAATAGCTTATGGTATGTAAT    FF:F::FF:F,:FF:F:FF,FFFFFFF:,FF,FFFF:FFFFFFFFFFFF,,FFFF:FF::FF,F    NM:i:4  BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:1  MI:i:1620051    XF:i:0  RG:Z:1

As you see the cigars are 8006656M and 46139657M691D46137351c691D respectively which are both wrong. Also they have the input barcodes of TTTTTTTGTAAGGAACTGAA and AAAAAAAAAAAAATAAAAAA (found in read name) but the tagged barcode is BX:Z:AAAAAAAAAAAAAAAA-1 for both.

Hope this helps to locate the issue!

Subset: failing.fastq.gz

pontushojer commented 3 years ago

Hi @arshajii.

I was wondering if you have had the opportunity too look into this issue after I posted the subset?

arshajii / ema

ema align output SAM parsing error #39