bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Some inconsistencies when using a custom genome #2709

Closed amizeranschi closed 5 years ago

amizeranschi commented 5 years ago

Hi,

These are some rather minor nuisances but I figured they could be worth reporting, just in case they could be "symptoms" of something more serious.

Here is an archive with (hopefully) everything that is needed to reproduce this: replicate-VC-6-tools.tar.gz

I've set up a custom bcbio genome based on chr7 from hg19 and an associated GFF file (which I've converted to GTF) using the commands below. The GFF file is included in the archive above, but the FASTA file was too large to include (162 MB uncompressed and 48 MB GZIPped).

gffread Homo_sapiens.GRCh37.82.chr7.gff -T -o Homo_sapiens.GRCh37.82.chr7.gtf
bcbio_setup_genome.py -f hs37d5.chr7.fa -g Homo_sapiens.GRCh37.82.chr7.gtf -i bwa seq -n Hsapiens -b hg19-chr7

When I use the hg19-chr7 genome for variant calling with 6 tools, the final output directory ends up containing the following files:

$ ls -1 testingVC-merged/final/2019-03-03_testingVC-merged/
bcbio-nextgen-commands.log
bcbio-nextgen.log
HG00096-strelka2-annotated.vcf.gz
HG00096-strelka2-annotated.vcf.gz.tbi
HG00096-varscan-annotated.vcf.gz
HG00096-varscan-annotated.vcf.gz.tbi
HG00097-strelka2-annotated.vcf.gz
HG00097-strelka2-annotated.vcf.gz.tbi
HG00097-varscan-annotated.vcf.gz
HG00097-varscan-annotated.vcf.gz.tbi
metadata.csv
multiqc
programs.txt
project-summary.yaml
testingVC-ensemble-annotated.vcf.gz
testingVC-ensemble-annotated.vcf.gz.tbi
testingVC-freebayes-annotated.vcf.gz
testingVC-freebayes-annotated.vcf.gz.tbi
testingVC-gatk-haplotype-annotated.vcf.gz
testingVC-gatk-haplotype-annotated.vcf.gz.tbi
testingVC-platypus-annotated.vcf.gz
testingVC-platypus-annotated.vcf.gz.tbi
testingVC-samtools-annotated.vcf.gz
testingVC-samtools-annotated.vcf.gz.tbi

My question is, in the case of Strelka2 and Varscan, why are there VCF files being copied for individual samples (HG00096 and HG00097), and not for the whole batch (testingVC)?

Strangely, this only seems to happen with the custom genome that I've set up. If I replace hg19-chr7 with hg19 in the YAML template, the resulting VCF files are as I would expect:

$ ls -1 testingVC-merged/final/2019-03-03_testingVC-merged/
bcbio-nextgen-commands.log
bcbio-nextgen.log
data_versions.csv
metadata.csv
multiqc
programs.txt
project-summary.yaml
testingVC-ensemble-annotated.vcf.gz
testingVC-ensemble-annotated.vcf.gz.tbi
testingVC-freebayes-annotated.vcf.gz
testingVC-freebayes-annotated.vcf.gz.tbi
testingVC-gatk-haplotype-annotated.vcf.gz
testingVC-gatk-haplotype-annotated.vcf.gz.tbi
testingVC-platypus-annotated.vcf.gz
testingVC-platypus-annotated.vcf.gz.tbi
testingVC-samtools-annotated.vcf.gz
testingVC-samtools-annotated.vcf.gz.tbi
testingVC-strelka2-annotated.vcf.gz
testingVC-strelka2-annotated.vcf.gz.tbi
testingVC-varscan-annotated.vcf.gz
testingVC-varscan-annotated.vcf.gz.tbi
chapmanb commented 5 years ago

Thanks for the question and the detailed report. My guess here is that there are no variants called for varscan and strelka2 with the custom genome, due to low alignment in your custom genome region. Does that match what you're seeing with the output files?

In general you should align against the full genome rather than custom regions. Excluding genomic regions can lead to misalignments and false variant artifacts. If a read would ideally align better somewhere else in the genome, but you exclude it, then it can align sub-optimally to what's left and the alignment quality won't reflect this.

Hope this helps explain what is going on.

amizeranschi commented 5 years ago

Thanks for looking into this. The FASTQ files actually contain reads that should only map to a specific gene on chr7. So, whether the reads are mapped to the full hg19 genome or to the custom hg19-chr7 genome shouldn't make a difference in this case.

There are variants in the strelka2 and varscan VCF files when using either genome. The issue seems somehow related to using the custom genome, but I can't imagine what could cause this. However, I'm wondering whether there aren't any other similar inconsistencies related to using the custom genome that I just haven't noticed yet.

Here's what things look like in IGV for one of the samples. For both VCF and BAM files, the top corresponds to reads mapped to the full hg19 genome, while the bottom is for reads mapped to the custom hg19-chr7 genome.

The weird thing, when using the custom genome, is that the strelka2 and varscan VCF files with individual-sample names (e.g. HG00096-strelka2-annotated.vcf.gz and HG00097-strelka2-annotated.vcf.gz) are both identical and they are actually multi-sample VCF files. This could just be a problem with the stage-out phase, when files are getting copied to the final directory while using a custom genome.

hg19-chr7

amizeranschi commented 5 years ago

Which part of the code handles copying the results into the final directory at the end of a run? I've read through the Code section of the docs but couldn't figure this out.

chapmanb commented 5 years ago

Sorry again about this issue, I haven't been able to dig enough to reproduce and figure out what is happening with your custom genome setup. The upload code for the final directory is here:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/upload/__init__.py

If you have any insight into what is happening happy to adjust and fix. Thanks again.

amizeranschi commented 5 years ago

I couldn't find anything in that code that could cause what I've described above, so here's a simple reproducible example that should actually take less time to set up and run than the time it'll take you to read this whole post. The example is based on the CWL test files.

NOTE: While setting this up I've also uncovered another potential issue with SSL. When running the bcbio_setup_genome.py command below I initially got the following error, using a recently upgraded development version of bcbio:


Creating the seq index.
Writing out merged GTF file to /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.gtf.
gtfToGenePred: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 841, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
  File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 300, in main
    gtf_to_refflat(gtf_file)
  File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 440, in gtf_to_refflat
    genepred = gtf_to_genepred(gtf)
  File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 432, in gtf_to_genepred
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.gtf /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.genePred' returned non-zero exit status 127
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_setup_genome.py", line 308, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/bin/python /export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py --cores 1 --genome-dir /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes --gtf /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/rnaseq/ref-transcripts.gtf Hsapiens hg19-custom' returned non-zero exit status 1

I searched for all the libssl.so files inside my bcbio install:

$ find ${bcbio_path}/anaconda/ -name libssl.so.*
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1a-h7b6447c_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h14c3975_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.0.2r-h14c3975_0/lib/libssl.so.1.0.0
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h7b6447c_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h14c3975_1/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h7b6447c_1/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/python3/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/samtools0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/dv/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/python2/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/libssl.so.1.0.0

I bypassed the bcbio_setup_genome.py errors by creating a symlink called libssl.so.1.0.0 inside ${bcbio_path}/anaconda/lib, pointing to ${bcbio_path}/anaconda/lib/libssl.so.1.1, and similarly for libcrypto.so:

ln -s ${bcbio_path}/anaconda/lib/libssl.so.1.1 ${bcbio_path}/anaconda/lib/libssl.so.1.0.0
ln -s ${bcbio_path}/anaconda/lib/libcrypto.so.1.1 ${bcbio_path}/anaconda/lib/libcrypto.so.1.0.0

Reproducible example starts here:

You should set the first three variables below (bcbio_path, test_path and n_cores) to something that makes sense on your hardware and then run the following code as is. Please let me know if you get the same results as I did.


bcbio_path=/export/home/ncit/external/a.mizeranschi/bcbio_nextgen
test_path=/export/home/ncit/external/a.mizeranschi
n_cores=16

export PATH=${bcbio_path}/tools/bin:${bcbio_path}/anaconda/bin:${PATH}
cd ${test_path}
git clone https://github.com/bcbio/test_bcbio_cwl.git
cd ${test_path}/test_bcbio_cwl

## set up a custom genome based on the small hg19 files from the CWL tests
bcbio_setup_genome.py -f ${test_path}/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa -g ${test_path}/test_bcbio_cwl/testdata/genomes/hg19/rnaseq/ref-transcripts.gtf -i bwa bowtie2 seq -n Hsapiens -b hg19-custom

## set up a yaml template
echo "---" > test-custom-genome.yaml
echo "details:" >> test-custom-genome.yaml
echo "  - analysis: variant2" >> test-custom-genome.yaml
echo "    genome_build: hg19-custom" >> test-custom-genome.yaml
echo "    algorithm:" >> test-custom-genome.yaml
echo "      aligner: false" >> test-custom-genome.yaml
echo "      mark_duplicates: true" >> test-custom-genome.yaml
echo "      bam_clean: fixrg" >> test-custom-genome.yaml
echo "      recalibrate: false" >> test-custom-genome.yaml
echo "      realign: false" >> test-custom-genome.yaml
echo "      effects: false" >> test-custom-genome.yaml
echo "      tools_off: [gemini, contamination, peddy]" >> test-custom-genome.yaml
echo "      variantcaller: [gatk-haplotype, samtools, varscan, freebayes, platypus, strelka2]" >> test-custom-genome.yaml
echo "      ensemble:" >> test-custom-genome.yaml
echo "        numpass: 3" >> test-custom-genome.yaml
echo "      variant_regions: ${test_path}/test_bcbio_cwl/testdata/automated/variant_regions-bam.bed" >> test-custom-genome.yaml

## set up a CSV file for two small BAM files from the CWL tests
echo "samplename,description,batch" > test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/6_100326_FC6107FAAXX.bam,6_100326_FC6107FAAXX,test-custom-genome" >> test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/6_100326_FC6107FAAXX_2.bam,6_100326_FC6107FAAXX_2,test-custom-genome" >> test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX.bam,7_100326_FC6107FAAXX,test-custom-genome" >> test-custom-genome.csv

## set up and run bcbio on the previous files
bcbio_nextgen.py --separator R -w template test-custom-genome.yaml test-custom-genome.csv ${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/*.bam
cd ${test_path}/test_bcbio_cwl/test-custom-genome/work

bcbio_nextgen.py ../config/test-custom-genome.yaml -n ${n_cores}

##check the results
cd ${test_path}/test_bcbio_cwl/test-custom-genome/final/*_test-custom-genome/
ls -1 *.vcf.gz

The outcome:

When running the stuff above, I end up with the following list of VCF files. Notice how bcbio didn't seem to produce batch VCF files for varscan and strelka2, at least according to the file names:


$ ls -1 *.vcf.gz
6_100326_FC6107FAAXX_2-strelka2-annotated.vcf.gz
6_100326_FC6107FAAXX_2-varscan-annotated.vcf.gz
6_100326_FC6107FAAXX-strelka2-annotated.vcf.gz
6_100326_FC6107FAAXX-varscan-annotated.vcf.gz
7_100326_FC6107FAAXX-strelka2-annotated.vcf.gz
7_100326_FC6107FAAXX-varscan-annotated.vcf.gz
test-custom-genome-ensemble-annotated.vcf.gz
test-custom-genome-freebayes-annotated.vcf.gz
test-custom-genome-gatk-haplotype-annotated.vcf.gz
test-custom-genome-platypus-annotated.vcf.gz
test-custom-genome-samtools-annotated.vcf.gz

The first 6 VCF files are actually batch VCF files (and NOT single-sample files), but they seem to be duplicated for each of the original samples, for some reason.


$ zcat 6_100326_FC6107FAAXX_2-strelka2-annotated.vcf.gz
[...]
##source=strelka
##source_version=2.9.10
##startTime=Thu Mar 14 18:32:44 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  6_100326_FC6107FAAXX    6_100326_FC6107FAAXX_2  7_100326_FC6107FAAXX
chrM    150 .   T   C   3070    PASS    MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1   1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1   1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1
chrM    152 .   T   C   3070    PASS    MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899
chrM    195 .   C   T   3070    PASS    MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1   1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1   1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1
[...]
amizeranschi commented 5 years ago

Note: just to be clear, the SSL issues mentioned in the post above and the main issues discussed here (with the varscan and strelka2 batch VCF files when using custom genomes) are definitely unrelated.

I wasn't getting the SSL issues when I first opened this discussion 12 days ago and they only appeared recently, most probably due to upgrades to bcbio's development branch.

chapmanb commented 5 years ago

Thanks much for the smaller reproducible example and the detailed report. I pushed a fix to the latest development version that should resolve this problem, and also updated gtfToGenePred to rebuild the conda package with openssl 1.1.x. If you update with:

bcbio_nextgen.py upgrade -u development

you should get both fixes and hopefully things will work cleanly. Thanks again for helping debug this issue.

amizeranschi commented 5 years ago

Thanks a lot for those fixes. After upgrading bcbio, all of the issues mentioned here went away.