I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.
running: /home/oender/anaconda3/envs/population-epigenetics/bin/python /home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:test-line_A-R.classified.qc\tSM:test-line_A-R.classified.qc' -t 4 ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -
converting reads in data/test/test-line_A-R1.classified.qc.fastq,data/test/test-line_A-R2.classified.qc.fastq
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 320080 sequences (40000212 bp)...
[M::process] 0 single-end sequences; 320080 paired-end sequences
WARNING: 1709 reads with length < 80
: this program is designed for long reads
[M::process] read 121626 sequences (15199052 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 97487, 4, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 215, 277)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 493)
[M::mem_pestat] mean and std.dev: (227.57, 79.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 601)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 320080 reads in 245.362 CPU sec, 61.338 real sec
['NM:i:21', 'MD:Z:40^GGAATTGTTGATTTGGATTT80G5', 'MC:Z:126M', 'AS:i:97', 'XS:i:83', 'RG:Z:test-line_A-R.classified.qc', 'XA:Z:f3,+14193782,40S86M,1;f3,+14204191,40S86M,1;', 'RG:Z:CB0L6ANXX:1:ATTCCT YS:Z:TTTGGATTTGGAATTGTTGAGAAAAGTTTATCGGGTTTGAGGAATTGTTGAGAAAAGTTTATTGGGTTTGAGGATTTGTTGATTAGGAGTGGAAATTGTTGAGAAAAATTTATTGGGTTTTAGGAA', 'YC:Z:CT']
700523F:121:CB0L6ANXX:1:1103:2712:2482
Traceback (most recent call last):
File "/home/oender/anaconda3/envs/population-epigenetics/bin/bwameth.py", line 4, in <module>
__import__('pkg_resources').run_script('bwameth==0.2.2', 'bwameth.py')
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 664, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1444, in run_script
exec(code, namespace, namespace)
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 509, in <module>
main(sys.argv[1:])
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 506, in main
set_as_failed=args.set_as_failed)
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 331, in bwa_mem
as_bam(cmd, fa, set_as_failed)
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 353, in as_bam
for aln in handle_reads(pair_list, set_as_failed):
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 376, in handle_reads
orig_seq = aln.original_seq
File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 284, in original_seq
return next(x for x in self.other if x.startswith("YS:Z:"))[5:]
StopIteration
[M::process] 0 single-end sequences; 121626 paired-end sequences
As you can see, RG:Z:CB0L6ANXX:1:ATTCCT is the RG that was part of the FASTQ input:
I think it is a bug that bwameth adds RG:Z:test-line_A-R.classified.qc although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command
(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.
Suggestion
As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function bwa_mem as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,
rg=args.read_group or rname(*args.fastqs)
causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.
thanks for the careful description. I don't intend to fix, but will accept a PR that does. I think that would also require manually setting the addition of the RG to the header.
Problem
I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.
Details
This is the command I run:
The stdout/stderr output is here:
As you can see,
RG:Z:CB0L6ANXX:1:ATTCCT
is the RG that was part of the FASTQ input:I think it is a bug that bwameth adds
RG:Z:test-line_A-R.classified.qc
although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.
Suggestion
As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function
bwa_mem
as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.