Do not add RG by default

okartal commented 5 years ago

Problem

I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.

Details

This is the command I run:

> bwameth.py --reference ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta -t 4 data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq > data/test/test-line_A.mapped.sam

The stdout/stderr output is here:

running: /home/oender/anaconda3/envs/population-epigenetics/bin/python /home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:test-line_A-R.classified.qc\tSM:test-line_A-R.classified.qc' -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -
converting reads in data/test/test-line_A-R1.classified.qc.fastq,data/test/test-line_A-R2.classified.qc.fastq
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 320080 sequences (40000212 bp)...
[M::process] 0 single-end sequences; 320080 paired-end sequences
WARNING: 1709 reads with length < 80
       : this program is designed for long reads
[M::process] read 121626 sequences (15199052 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 97487, 4, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 215, 277)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 493)
[M::mem_pestat] mean and std.dev: (227.57, 79.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 601)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 320080 reads in 245.362 CPU sec, 61.338 real sec

['NM:i:21', 'MD:Z:40^GGAATTGTTGATTTGGATTT80G5', 'MC:Z:126M', 'AS:i:97', 'XS:i:83', 'RG:Z:test-line_A-R.classified.qc', 'XA:Z:f3,+14193782,40S86M,1;f3,+14204191,40S86M,1;', 'RG:Z:CB0L6ANXX:1:ATTCCT YS:Z:TTTGGATTTGGAATTGTTGAGAAAAGTTTATCGGGTTTGAGGAATTGTTGAGAAAAGTTTATTGGGTTTGAGGATTTGTTGATTAGGAGTGGAAATTGTTGAGAAAAATTTATTGGGTTTTAGGAA', 'YC:Z:CT']
700523F:121:CB0L6ANXX:1:1103:2712:2482
Traceback (most recent call last):
  File "/home/oender/anaconda3/envs/population-epigenetics/bin/bwameth.py", line 4, in <module>
    __import__('pkg_resources').run_script('bwameth==0.2.2', 'bwameth.py')
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1444, in run_script
    exec(code, namespace, namespace)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 509, in <module>
    main(sys.argv[1:])
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 506, in main
    set_as_failed=args.set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 331, in bwa_mem
    as_bam(cmd, fa, set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 353, in as_bam
    for aln in handle_reads(pair_list, set_as_failed):
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 376, in handle_reads
    orig_seq = aln.original_seq
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 284, in original_seq
    return next(x for x in self.other if x.startswith("YS:Z:"))[5:]
StopIteration
[M::process] 0 single-end sequences; 121626 paired-end sequences

As you can see, RG:Z:CB0L6ANXX:1:ATTCCT is the RG that was part of the FASTQ input:

> head -n1 data/test/test-line_A-R{1,2}.classified.qc.fastq
==> data/test/test-line_A-R1.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482 RG:Z:CB0L6ANXX:1:ATTCCT

==> data/test/test-line_A-R2.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482 RG:Z:CB0L6ANXX:1:ATTCCT

I think it is a bug that bwameth adds RG:Z:test-line_A-R.classified.qc although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command

bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -

(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.

Suggestion

As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function bwa_mem as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,

rg=args.read_group or rname(*args.fastqs)

causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.

brentp commented 5 years ago

thanks for the careful description. I don't intend to fix, but will accept a PR that does. I think that would also require manually setting the addition of the RG to the header.

okartal commented 5 years ago

@brentp I will fork and try it

brentp / bwa-meth