Closed amizeranschi closed 5 years ago
Thanks for the question and the detailed report. My guess here is that there are no variants called for varscan and strelka2 with the custom genome, due to low alignment in your custom genome region. Does that match what you're seeing with the output files?
In general you should align against the full genome rather than custom regions. Excluding genomic regions can lead to misalignments and false variant artifacts. If a read would ideally align better somewhere else in the genome, but you exclude it, then it can align sub-optimally to what's left and the alignment quality won't reflect this.
Hope this helps explain what is going on.
Thanks for looking into this. The FASTQ files actually contain reads that should only map to a specific gene on chr7. So, whether the reads are mapped to the full hg19 genome or to the custom hg19-chr7 genome shouldn't make a difference in this case.
There are variants in the strelka2 and varscan VCF files when using either genome. The issue seems somehow related to using the custom genome, but I can't imagine what could cause this. However, I'm wondering whether there aren't any other similar inconsistencies related to using the custom genome that I just haven't noticed yet.
Here's what things look like in IGV for one of the samples. For both VCF and BAM files, the top corresponds to reads mapped to the full hg19 genome, while the bottom is for reads mapped to the custom hg19-chr7 genome.
The weird thing, when using the custom genome, is that the strelka2 and varscan VCF files with individual-sample names (e.g. HG00096-strelka2-annotated.vcf.gz
and HG00097-strelka2-annotated.vcf.gz
) are both identical and they are actually multi-sample VCF files. This could just be a problem with the stage-out phase, when files are getting copied to the final directory while using a custom genome.
Which part of the code handles copying the results into the final
directory at the end of a run? I've read through the Code section of the docs but couldn't figure this out.
Sorry again about this issue, I haven't been able to dig enough to reproduce and figure out what is happening with your custom genome setup. The upload code for the final directory is here:
https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/upload/__init__.py
If you have any insight into what is happening happy to adjust and fix. Thanks again.
I couldn't find anything in that code that could cause what I've described above, so here's a simple reproducible example that should actually take less time to set up and run than the time it'll take you to read this whole post. The example is based on the CWL test files.
NOTE: While setting this up I've also uncovered another potential issue with SSL. When running the bcbio_setup_genome.py
command below I initially got the following error, using a recently upgraded development version of bcbio:
Creating the seq index.
Writing out merged GTF file to /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.gtf.
gtfToGenePred: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 841, in <module>
main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 300, in main
gtf_to_refflat(gtf_file)
File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 440, in gtf_to_refflat
genepred = gtf_to_genepred(gtf)
File "/export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py", line 432, in gtf_to_genepred
subprocess.check_call(cmd.format(**locals()), shell=True)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.gtf /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/tmpcbl/ref-transcripts.genePred' returned non-zero exit status 127
Traceback (most recent call last):
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_setup_genome.py", line 308, in <module>
subprocess.check_call(cmd.format(**locals()), shell=True)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/bin/python /export/home/ncit/external/a.mizeranschi/temp/test_bcbio_cwl/cloudbiolinux/utils/prepare_tx_gff.py --cores 1 --genome-dir /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes --gtf /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/genomes/Hsapiens/hg19-custom/rnaseq/ref-transcripts.gtf Hsapiens hg19-custom' returned non-zero exit status 1
I searched for all the libssl.so files inside my bcbio install:
$ find ${bcbio_path}/anaconda/ -name libssl.so.*
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1a-h7b6447c_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h14c3975_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.0.2r-h14c3975_0/lib/libssl.so.1.0.0
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h7b6447c_0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h14c3975_1/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/pkgs/openssl-1.1.1b-h7b6447c_1/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/python3/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/samtools0/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/dv/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/python2/lib/libssl.so.1.1
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/libssl.so.1.0.0
I bypassed the bcbio_setup_genome.py
errors by creating a symlink called libssl.so.1.0.0
inside ${bcbio_path}/anaconda/lib
, pointing to ${bcbio_path}/anaconda/lib/libssl.so.1.1
, and similarly for libcrypto.so
:
ln -s ${bcbio_path}/anaconda/lib/libssl.so.1.1 ${bcbio_path}/anaconda/lib/libssl.so.1.0.0
ln -s ${bcbio_path}/anaconda/lib/libcrypto.so.1.1 ${bcbio_path}/anaconda/lib/libcrypto.so.1.0.0
You should set the first three variables below (bcbio_path
, test_path
and n_cores
) to something that makes sense on your hardware and then run the following code as is. Please let me know if you get the same results as I did.
bcbio_path=/export/home/ncit/external/a.mizeranschi/bcbio_nextgen
test_path=/export/home/ncit/external/a.mizeranschi
n_cores=16
export PATH=${bcbio_path}/tools/bin:${bcbio_path}/anaconda/bin:${PATH}
cd ${test_path}
git clone https://github.com/bcbio/test_bcbio_cwl.git
cd ${test_path}/test_bcbio_cwl
## set up a custom genome based on the small hg19 files from the CWL tests
bcbio_setup_genome.py -f ${test_path}/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa -g ${test_path}/test_bcbio_cwl/testdata/genomes/hg19/rnaseq/ref-transcripts.gtf -i bwa bowtie2 seq -n Hsapiens -b hg19-custom
## set up a yaml template
echo "---" > test-custom-genome.yaml
echo "details:" >> test-custom-genome.yaml
echo " - analysis: variant2" >> test-custom-genome.yaml
echo " genome_build: hg19-custom" >> test-custom-genome.yaml
echo " algorithm:" >> test-custom-genome.yaml
echo " aligner: false" >> test-custom-genome.yaml
echo " mark_duplicates: true" >> test-custom-genome.yaml
echo " bam_clean: fixrg" >> test-custom-genome.yaml
echo " recalibrate: false" >> test-custom-genome.yaml
echo " realign: false" >> test-custom-genome.yaml
echo " effects: false" >> test-custom-genome.yaml
echo " tools_off: [gemini, contamination, peddy]" >> test-custom-genome.yaml
echo " variantcaller: [gatk-haplotype, samtools, varscan, freebayes, platypus, strelka2]" >> test-custom-genome.yaml
echo " ensemble:" >> test-custom-genome.yaml
echo " numpass: 3" >> test-custom-genome.yaml
echo " variant_regions: ${test_path}/test_bcbio_cwl/testdata/automated/variant_regions-bam.bed" >> test-custom-genome.yaml
## set up a CSV file for two small BAM files from the CWL tests
echo "samplename,description,batch" > test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/6_100326_FC6107FAAXX.bam,6_100326_FC6107FAAXX,test-custom-genome" >> test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/6_100326_FC6107FAAXX_2.bam,6_100326_FC6107FAAXX_2,test-custom-genome" >> test-custom-genome.csv
echo "${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX.bam,7_100326_FC6107FAAXX,test-custom-genome" >> test-custom-genome.csv
## set up and run bcbio on the previous files
bcbio_nextgen.py --separator R -w template test-custom-genome.yaml test-custom-genome.csv ${test_path}/test_bcbio_cwl/testdata/100326_FC6107FAAXX/*.bam
cd ${test_path}/test_bcbio_cwl/test-custom-genome/work
bcbio_nextgen.py ../config/test-custom-genome.yaml -n ${n_cores}
##check the results
cd ${test_path}/test_bcbio_cwl/test-custom-genome/final/*_test-custom-genome/
ls -1 *.vcf.gz
When running the stuff above, I end up with the following list of VCF files. Notice how bcbio didn't seem to produce batch VCF files for varscan and strelka2, at least according to the file names:
$ ls -1 *.vcf.gz
6_100326_FC6107FAAXX_2-strelka2-annotated.vcf.gz
6_100326_FC6107FAAXX_2-varscan-annotated.vcf.gz
6_100326_FC6107FAAXX-strelka2-annotated.vcf.gz
6_100326_FC6107FAAXX-varscan-annotated.vcf.gz
7_100326_FC6107FAAXX-strelka2-annotated.vcf.gz
7_100326_FC6107FAAXX-varscan-annotated.vcf.gz
test-custom-genome-ensemble-annotated.vcf.gz
test-custom-genome-freebayes-annotated.vcf.gz
test-custom-genome-gatk-haplotype-annotated.vcf.gz
test-custom-genome-platypus-annotated.vcf.gz
test-custom-genome-samtools-annotated.vcf.gz
The first 6 VCF files are actually batch VCF files (and NOT single-sample files), but they seem to be duplicated for each of the original samples, for some reason.
$ zcat 6_100326_FC6107FAAXX_2-strelka2-annotated.vcf.gz
[...]
##source=strelka
##source_version=2.9.10
##startTime=Thu Mar 14 18:32:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 6_100326_FC6107FAAXX 6_100326_FC6107FAAXX_2 7_100326_FC6107FAAXX
chrM 150 . T C 3070 PASS MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1 1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1 1:3070:3070:986:20:0,986:0,429:0,557:-99:PASS:370,0:1
chrM 152 . T C 3070 PASS MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899 1:3070:3070:990:16:1,989:1,432:0,557:-99:PASS:370,0:0.99899
chrM 195 . C T 3070 PASS MQ=60;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL:AF 1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1 1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1 1:3070:3070:926:11:0,926:0,351:0,575:-99:PASS:370,0:1
[...]
Note: just to be clear, the SSL issues mentioned in the post above and the main issues discussed here (with the varscan and strelka2 batch VCF files when using custom genomes) are definitely unrelated.
I wasn't getting the SSL issues when I first opened this discussion 12 days ago and they only appeared recently, most probably due to upgrades to bcbio's development branch.
Thanks much for the smaller reproducible example and the detailed report. I pushed a fix to the latest development version that should resolve this problem, and also updated gtfToGenePred to rebuild the conda package with openssl 1.1.x. If you update with:
bcbio_nextgen.py upgrade -u development
you should get both fixes and hopefully things will work cleanly. Thanks again for helping debug this issue.
Thanks a lot for those fixes. After upgrading bcbio, all of the issues mentioned here went away.
Hi,
These are some rather minor nuisances but I figured they could be worth reporting, just in case they could be "symptoms" of something more serious.
Here is an archive with (hopefully) everything that is needed to reproduce this: replicate-VC-6-tools.tar.gz
I've set up a custom bcbio genome based on chr7 from hg19 and an associated GFF file (which I've converted to GTF) using the commands below. The GFF file is included in the archive above, but the FASTA file was too large to include (162 MB uncompressed and 48 MB GZIPped).
When I use the
hg19-chr7
genome for variant calling with 6 tools, the final output directory ends up containing the following files:My question is, in the case of Strelka2 and Varscan, why are there VCF files being copied for individual samples (HG00096 and HG00097), and not for the whole batch (testingVC)?
Strangely, this only seems to happen with the custom genome that I've set up. If I replace
hg19-chr7
withhg19
in the YAML template, the resulting VCF files are as I would expect: