bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
988 stars 354 forks source link

RNA-seq workflow not working with GATK3 but works with GATK4 #2835

Closed komalsrathi closed 5 years ago

komalsrathi commented 5 years ago

Hi,

I am testing the same sample (~400 MB bam file with just chromosome X) for RNA-seq variant calling using GATK3 and GATK4.

Here is my config files, both are exactly the same except for tools_off has gatk4 in the first config:

GATK3:

details:
- algorithm:
    aligner: false
    bam_sort: false
    variantcaller: gatk-haplotype
    tools_off: [gatk4, gemini, vcfanno, gatk-haplotype-joint, salmon, htseq-count, fastqc, multi-qc]
  analysis: RNA-seq
  description: 'Sample 1: 27571P'
  files:
  - Sample_1__27571P-dedup.splitN_chrX.bam
  genome_build: GRCh37
  metadata:
    batch: 27571P
fc_name: 27571P-CDLS
upload:
  dir: /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/27571P_gatk3

GATK4:

details:
- algorithm:
    aligner: false
    bam_sort: false
    variantcaller: gatk-haplotype
    tools_off: [gemini, vcfanno, gatk-haplotype-joint, salmon, htseq-count, fastqc, multi-qc]
  analysis: RNA-seq
  description: 'Sample 1: 27571P'
  files:
  - Sample_1__27571P-dedup.splitN_chrX.bam
  genome_build: GRCh37
  metadata:
    batch: 27571P
fc_name: 27571P-CDLS
upload:
  dir: /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/27571P_gatk4

The run for GATK4 finishes without any issues but I am consistently getting the following error with GATK3:

##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@338270ea appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63. Please see https://software.broadinstitute.org/gatk/documentation/article?id=6470 for more details and options related to this error.

I have uploaded the input bam file and the log files for both the runs here: https://drive.google.com/open?id=1qnaHW34r30JyyZsWpO_c4pRBdHw2aqTm.

Please let me know if you need any other information from my side.

Thanks, Komal

roryk commented 5 years ago

Hi Komal,

Something is broken with your BAM file. I'd set the aligner to STAR, which will re-align and give a proper BAM file.

komalsrathi commented 5 years ago

I figured because of the error, but then why does it work when I have exactly the same configuration and input bam file for GATK4?

roryk commented 5 years ago

Hi Komal,

I'm not sure, I'd have to look at it more in depth to figure out why it is working with one and not the other, but I'm overwhelmed with other things to look at so it is unlikely I'm ever going to spend the time to figure it out.

komalsrathi commented 5 years ago

Sure, I get it.

If you don't mind, I would like to keep this open as I have tried to debug it myself but cannot figure out what the exact issue is.

roryk commented 5 years ago

Ok! Thanks!

naumenko-sa commented 5 years ago

Hi @komalsrathi!

Thanks for reporting this issue!

I looks like gatk3.8 complaining about quality scores in your file. They indeed look unusual (I converted bam2fastq):

@K00283:161:HL3YHBBXX:4:1113:31142:43251/2
GCAGAAGTGGCGGTCTCCTCCCTTAGGATTACAGGCACCCTGCGTTAACCTCAAAATTGTCTCAGTCCCAAAGAAGGGGCTAGATTTTCTTTTATACTTT
+
``eeeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
@K00283:161:HL3YHBBXX:4:2216:27773:8365/2
CTCGAGGTGATCCCGAGGGAAGGAGCGGGGGTCTGAGGGTGGTCCCGAGAGGGACCCGAGGGGTGGAGCGGGGGGAGGGTCTGGAGATGGCCCCGAGGAG
+
``eeeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiieiiii[ii[ieiiiiiiieei`i[eieiiiiieieViiiiii

The range of quality scores in your file:

Min phred:.66
Max phred:.105

in a typical file:

Min phred:.35
Max phred:.73

It seems that you are dealing with a rare sample from [llumina1.5, Illumina1.8) era, those had different quality scores: https://en.wikipedia.org/wiki/FASTQ_format .

Not sure why gatk4 work and gatk3 does not. Maybe gatk4 just ignores the scores, or decodes them properly. You may try to specify quality scores:

details:
- algorithm:
    aligner: star
    quality_format: illumina
    strandedness: unstranded
    tools_off:
    - gatk4
    variantcaller: gatk-haplotype
  analysis: RNA-seq
  description: test_sample1
  files:
  - /path/to/test2835/test/input/test_sample1.bam
  genome_build: GRCh37
  metadata:
    batch: test
fc_name: test
resources:
  default:
    cores: 2
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 15G
upload:
  dir: ../final

I'm not sure if this works in RNA-seq pipeline or in DNA variant calling only (variant2). If the trick is not working for RNA-seq, you may try to realign and correct quality scores with variant2, and then call variants with RNA-seq pipeline using the new bam file.

Please let us know if that works for you.

Sergey

komalsrathi commented 5 years ago

Hi,

Thanks for the response. I also tried using just the fastq input (instead of the bam) and I get exactly the same error message using GATK3.8. I will definitely try your suggestion by modifying the config file and update here.

Thanks Komal

roryk commented 5 years ago

Thanks, +1 for Sergey's suggestion, that should work.

komalsrathi commented 5 years ago

I tried again using the config file suggested above and also with fastq input files but I get the same error. Is there a way to correct quality scores using variant2 in the bcbio pipeline?

naumenko-sa commented 5 years ago

Hi Komal!

Thanks for confirming that quality_format: illumina is not working with RNA-seq pipeline.

I was able to convert your example bam file with variant2 pipeline:

details:
- algorithm:
    aligner: bwa
    quality_format: illumina
    mark_duplicates: false
    realign: false
    recalibrate: false
    save_diskspace: true
    variantcaller: false
  analysis: variant2
  description: test_sample1
  files:
  - /path/test2835/test/input/test_sample1.bam
  genome_build: GRCh37
  metadata:
    batch: test
fc_name: test
resources:
  default:
    cores: 3
    jvm_opts:
    - -Xms750m
    - -Xmx3000m
    memory: 3G
upload:
  dir: ../final

Variant2 correctly parses Illumina [1.5-1.8) phred format (see Q64 option):

bamtofastq filename=/path/test2835/test/input/test_sample1.bam T=/path/test2835/test/work/bcbiotx/tmpqdc20tnl/test_sample1-1.fq-sort  F=>(/path/bcbio_1.1.5/galaxy/../anaconda/bin/seqtk seq -Q64 -V /dev/stdin | bgzip -c /dev/stdin > test_sample1-1.fq.gz) F2=> ...

When converting the resulting bam file to fastq, quality scores are Phred+33:

@K00283:161:HL3YHBBXX:4:2106:18923:4884/1
GCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCA
+
AAFFFJJJJJJJJJJJJJ<<JJJFA<FJFFJJJJJJJJJJJJJJJFJJJJJJJFAJFJJJJFFFJF<FJJJJJJJJJJFJJJJJJJJJAJJJJJFFFFF<

I think you may use the output of variant2 (bam) as input for RNA-seq pipeline (realign with star).

Alternatively, you may just convert Phred+64 fastq files to Phred+33 format with bamtofastq or one of the many available converters (https://en.wikipedia.org/wiki/FASTQ_format). This will save you some compute time, as variant2 is not only converting the files but also performing bwa alignment.

Sergey

roryk commented 5 years ago

Ok, I see why this isn't working now. GATK3 had the ability to fix the quality scores on the fly, and GATK4 does not, so the quality scores don't get converted. There is an option to scale the quality scores by a number, I think if we subtract 31 from the quality scores we can get ones that will work.

roryk commented 5 years ago

I think it's probably a better idea to groom the FASTQ files though, if we swap the quality scores on the alignments then we have two different versions of the quality scores and we have to keep track of which is which.

roryk commented 5 years ago

I see-- it looks like we already do that, so I'm not sure why starting from the FASTQ files doesn't work with GATK4. Did you run with those FASTQ files and have the "illumina" option set as the quality_format?

komalsrathi commented 5 years ago

I am re-running with the following config. Let you know in a couple hours:

details:
- algorithm:
    aligner: star
    quality_format: illumina
    strandedness: unstranded
    tools_off:
    - gatk4
    variantcaller: gatk-haplotype
  analysis: RNA-seq
  description: 'Sample 1: 27571P'
  files:
  - Sample_1__27571P-dedup.splitN_chrX_1.fastq.gz
  - Sample_1__27571P-dedup.splitN_chrX_2.fastq.gz
  genome_build: GRCh37
  metadata:
    batch: 27571P
fc_name: 27571P-CDLS
resources:
  default:
    cores: 2
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 15G
upload:
  dir: realign
komalsrathi commented 5 years ago

I got the same error using the above config file with fastq input and quality_format: nohup_fastq.txt

komalsrathi commented 5 years ago

Related to my issue, in GATK docs, they suggest turning the following parameters on --fix_misencoded_quality_scores / -fixMisencodedQuals or -allowPotentiallyMisencodedQuals / --allow_potentially_misencoded_quality_scores. How do I do that in the bcbio pipeline?

roryk commented 5 years ago

Hi Komal,

We do that, but what is happening is with FASTQ files we groom them first, so the quality score should be fixed at that point, so then we don't run GATK with those options. You can see we turn them on here if they haven't been fixed yet.

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/rnaseq/variation.py#L52

But we fix them here:

https://github.com/bcbio/bcbio-nextgen/blob/b40970724e2827f7aa96dccfb99e8d6fe24136bd/bcbio/pipeline/fastq.py#L36

so I'm not sure what is happening.

komalsrathi commented 5 years ago

Hi,

Just keeping you updated: Using GATK3.8 and latest bcbio-nextgen installation, I did some tests on a small fastq file as is and a fastq file where I converted the scores from 1.3 to 1.8 (using https://en.wikipedia.org/wiki/FASTQ_format as suggested by Sergey) with and without the parameter quality_format: illumina:

Fastq file as is:

1. with quality_format: illumina - same error as above
2. without quality_format: illumina - same error as above

Converted fastq file Illumina 1.3 to 1.8

# convert
sed -e '4~4y/@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi/!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJ/' input.fastq  > output.fastq 

1. with quality_format: illumina - does not work as it has been converted to standard format. As expected, I get this error: ValueError: Quality format specified in the YAML file might be a different encoding. 'illumina' was specified but possible formats detected were standard
2. without quality_format: illumina - works

This baffles me and not sure why it is happening but in summary: with GATK4 I don't have to convert the scores but with GATK3.8, I have to. For now, I am just going to convert all the fastq files and run the pipeline.

Thanks, Komal

roryk commented 5 years ago

Hi Komal,

Are you still running into this issue? We never figured out what was happening, did you?

komalsrathi commented 5 years ago

Hi Rory,

The last thing that worked for me was to first convert the Fastq files from 1.3 to 1.8 which worked with Gatk3.8. I honestly could never figure out why the scores were not automatically getting converted when using v3.8. If you want, you can close this. Thank you!

roryk commented 5 years ago

Thanks Komal-- sorry for not having a good answer either, but I'm glad you got your work done. Thanks so much for following up!