Open pontushojer opened 4 years ago
Hi @pontushojer, Sorry for the delay in responding. This is quite strange, and looks like a bug that only comes up when processing a large set of reads.
Couple quick questions:
brew
/conda
)? Or are you using a version built from source?Hi @arshajii,
No worries!
- Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?
I am running a pre-built version from conda, version 0.6.2 build h8b12597_1.
- How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)
The datasets have been about 400-500 M read-pairs, so far I have had issues on about three of my dataset.
I have so far been unable to generate a smaller dataset to replicate the issue. If I extract the read-pairs for barcodes surrounding the entry that causes the error in the full dataset, it completes without error. I will continue to try and generate a subset, as you say it would help narrowing this down.
I can check about sending a full dataset...
@arshajii I have now managed to generate a smaller subset that can recreate the issue.
Running the following:
ema align -1 <(pigz -cd failing.fastq.gz) -R '@RG\tID:1\tSM:20\tPU:unit1\tPL:ILLUMINA' -r genome.fa -t 4 -p 10x 2> mapping.log | samtools sort - -@ 4 -o out.bam -O BAM -l 0 2> sorting.log
outputs this to the sorting.log
:
[E::sam_parse1] CIGAR and query sequence are of different length
samtools sort: truncated file. Aborting
If I skip the pipe to samtools sort
and look at the unsorted file from ema
I find two faulty entries:
A00187:292:H7G2JDSXY:3:2674:9516:18662:TTTTTTTGTAAGGAACTGAA 73 chrX 147384013 60 8006656M * 0 0 ATAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA F:FF,F:F,FFF,:FFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFF,FFFFFF,,FFFFFF NM:i:0 BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:1 MI:i:1620035 XF:i:0 RG:Z:1
...
A00187:292:H7G2JDSXY:3:1271:24126:15405:AAAAAAAAAAAAATAAAAAA 73 chr3 26999354 22 46139657M691D46137351c691D * 0 0 CCCCCTCATTGTCCTTGTCTATTACATTTTTATTTTTATATTATAATAGCTTATGGTATGTAAT FF:F::FF:F,:FF:F:FF,FFFFFFF:,FF,FFFF:FFFFFFFFFFFF,,FFFF:FF::FF,F NM:i:4 BX:Z:AAAAAAAAAAAAAAAA-1 XG:f:1 MI:i:1620051 XF:i:0 RG:Z:1
As you see the cigars are 8006656M
and 46139657M691D46137351c691D
respectively which are both wrong. Also they have the input barcodes of TTTTTTTGTAAGGAACTGAA
and AAAAAAAAAAAAATAAAAAA
(found in read name) but the tagged barcode is BX:Z:AAAAAAAAAAAAAAAA-1
for both.
Hope this helps to locate the issue!
Subset: failing.fastq.gz
Hi @arshajii.
I was wondering if you have had the opportunity too look into this issue after I posted the subset?
I have been running ema (version 0.6.2) on reads in the longranger basic FASTQ format (BX:Z in header). I pipe the output directly to
samtools sort
. My command looks something like this.Mostly this have been working fine but every other run have failed because
samtools sort
gets a parsing errorEither this one
Or this one
Both are related to the CIGAR so I think there is some formatting error here that creeps in every now and then. I looked SAM entries that caused the error in the one of the runs and found some strange things. Below are four lines, of which the third (marked in bold) is causing the error.
As you can see this SAM entry is just plain strange. It aslo contains a
^@
character for some reason. Interestingly the SEQ and QUAL strings for this entry does not below to the read with the QNAMEST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG
instead it belongs to the first entry in my example namedST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG
. This is also the read-pair just before the one causing the error in my FASTQ.I have tried to replicate the error on a smaller subset of my data but have so far been unsuccessful. For example if i take the FASTQ entries corresponding to the failed SAM entries shown above I don't get any error. So somehow this only happens when running the full dataset.
Do you have any idea what could be causing this?