bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Installing genome with annation crashes on "ValueError: No lines parsed -- was an empty file provided?" #1611

Closed NeillGibson closed 7 years ago

NeillGibson commented 7 years ago

Hi,

I am trying to install a genome with a gene annotation so that I can run snpEff together with variant calling.

The installation of the reference genome crashes when trying to do something with the gff3 file. "ValueError: No lines parsed -- was an empty file provided?"

Full error message:

[localhost] local: ln -sf ../seq/test_SnpEff.fa /Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.fa
Creating gffutils database for /Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/tmpcbl/ref-transcripts.gtf.
Traceback (most recent call last):
  File "/data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py", line 821, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores)
  File "/data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py", line 286, in main
    db = _get_gtf_db(gtf_file)
  File "/data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py", line 757, in _get_gtf_db
    disable_infer_transcripts, disable_infer_genes = guess_disable_infer_extent(gtf)
  File "/data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py", line 729, in guess_disable_infer_extent
    db = _create_tiny_gffutils_db(gtf_file)
  File "/data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py", line 700, in _create_tiny_gffutils_db
    disable_infer_transcripts=True)
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/gffutils/create.py", line 1273, in create_db
    c.create()
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/gffutils/create.py", line 488, in create
    self._populate_from_lines(self.iterator)
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/gffutils/create.py", line 609, in _populate_from_lines
    raise ValueError("No lines parsed -- was an empty file provided?")
ValueError: No lines parsed -- was an empty file provided?
Traceback (most recent call last):
  File "/tools/bioinfo/app/bcbio-0.9.9/bin/bcbio_setup_genome.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==0.9.9', 'bcbio_setup_genome.py')
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/setuptools-25.1.6-py2.7.egg/pkg_resources/__init__.py", line 719, in run_script
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/setuptools-25.1.6-py2.7.egg/pkg_resources/__init__.py", line 1505, in run_script
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.9-py2.7.egg-info/scripts/bcbio_setup_genome.py", line 277, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/Tools/bcbio-0.9.9/anaconda/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/Tools/bcbio-0.9.9/anaconda/bin/python /data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py --cores 1 --genome-dir /Tools/bcbio-0.9.9/genomes --gtf /Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq/ref-transcripts.gtf test_SnpEff test_SnpEff' returned non-zero exit status 1

The error mentions 2 gtf files. The first of which is really empty. This includes tmpcbl in the path

Creating gffutils database for /Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/tmpcbl/ref-transcripts.gtf

The second contains data, looks like the complete gff3 files parsed to gtf. This does include rnaseqin the path though I did not specify anything about rnaseq.

subprocess.CalledProcessError: Command '/Tools/bcbio-0.9.9/anaconda/bin/python /data/run/Projects/project-123/cloudbiolinux/utils/prepare_tx_gff.py --cores 1 --genome-dir /Tools/bcbio-0.9.9/genomes --gtf /Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq/ref-transcripts.gtf test_SnpEff test_SnpEff

Steps needed to reproduce this error

wget ftp://ftp.solgenomics.net/tomato_genome/assembly/build_2.50/S_lycopersicum_chromosomes.2.50.fa.gz
wget ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3.gz

gunzip ftp://ftp.solgenomics.net/tomato_genome/assembly/build_2.50/S_lycopersicum_chromosomes.2.50.fa.gz
gunzip ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3.gz

bcbio_setup_genome.py -f S_lycopersicum_chromosomes.2.50.fa -n test_SnpEff -b test_SnpEff -i bwa seq -g ITAG2.4_gene_models.gff3

Is there something wrong with the gff3 file or the command I use to install the reference genome? Or did I maybe run in to a bug?

Thank you for looking at this.

roryk commented 7 years ago

Hi Neil,

Sorry for the problem, could you give bcbio_setup_genome.py a shot running it with the --gff3 flag? For a GTF we're expecting there to be transcript_id and gene_id attributes to figure out which genes go with which transcripts and the --gff3 flag will try to reconstruct those from the ID/parent attributes in the GFF3 file.

NeillGibson commented 7 years ago

Hi Rory,

Thank you for the tip.

Adding the --gff3 made the reference genome installation finish without errors.

I did not see the --gff3 flag in on this documentation page but I could have seen it by just running bcbio_setup_genome.py -h http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes

The files that are now installed for the reference genome are below. The extra files that I see are bowtie indexes, tophat indexes and some other RNA-seq related files.

I don't see a specific folder or file for snpEff. I expected to see a file called something like snpEffectPredictor.bin as the snpEff database.

/Tools/bcbio-0.9.9/genomes/test_SnpEff/
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq/test_SnpEff.fa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq/test_SnpEff.fa.fai
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq/test_SnpEff.dict
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq/tx
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/seq/test_SnpEff-resources.yaml
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa/test_SnpEff.fa.pac
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa/test_SnpEff.fa.ann
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa/test_SnpEff.fa.amb
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa/test_SnpEff.fa.bwt
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bwa/test_SnpEff.fa.sa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.3.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.4.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.1.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.2.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.rev.1.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.rev.2.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/bowtie2/test_SnpEff.fa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/version.txt
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.gtf
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.gtf.db
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.genePred
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.refFlat
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.bed
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tx2gene.csv
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tx
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.gff
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.fa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.fa.tlst
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.ver
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.3.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.4.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.1.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.2.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.rev.1.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/tophat/test_SnpEff_transcriptome.rev.2.bt2
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/kallisto
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/kallisto/test_SnpEff
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff/rnaseq-2016-10-27/ref-transcripts.fa
/Tools/bcbio-0.9.9/genomes/test_SnpEff/test_SnpEff-rnaseq-2016-10-27.tar.xz

I tried to align, variant call and effect predict a few samples against this reference genome with the following yaml but did not produce a VCF with effect predictions

# Template for whole genome Illumina variant calling with FreeBayes
# This is a GATK-free pipeline without post-alignment BAM pre-processing
# (recalibration and realignment)
---
details:
  - analysis: variant2
    genome_build: test_SnpEff
    description:
    # to do multi-sample variant calling, assign samples the same metadata / batch
    metadata:
      batch: project_123
    algorithm:
      aligner: bwa
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: freebayes
      nomap_split_targets: 3000
      effects: snpeff
      tools_off:
      - gemini
      # for targetted projects, set the region
      # variant_regions: /path/to/your.bed
resources:
    freebayes:
        options: [--genotype-qualities, --min-mapping-quality 20]

The log file does not really mention effect prediction. Just Annotate VCF file but no time is spend there.

[2016-10-28T08:37Z] gridmaster: ipython: concat_variant_files
[2016-10-28T08:37Z] gridmaster: Timing: variant post-processing
[2016-10-28T08:37Z] gridmaster: ipython: postprocess_variants
[2016-10-28T08:37Z] node17: Finalizing variant calls: project_123_02, freebayes
[2016-10-28T08:37Z] node17: Calculating variation effects for project_123_02, freebayes
[2016-10-28T08:37Z] node17: Annotate VCF file: project_123_02, freebayes
[2016-10-28T08:37Z] node17: Filtering for project_123_02, freebayes
[2016-10-28T08:54Z] node17: Prioritization for project_123_02, freebayes
[2016-10-28T08:54Z] node17: Germline extraction for project_123_02, freebayes
[2016-10-28T08:54Z] gridmaster: ipython: split_variants_by_sample
[2016-10-28T09:50Z] gridmaster: Timing: prepped BAM merging
[2016-10-28T09:50Z] gridmaster: Timing: validation
[2016-10-28T09:50Z] gridmaster: ipython: compare_to_rm
[2016-10-28T09:50Z] gridmaster: Timing: ensemble calling
[2016-10-28T09:50Z] gridmaster: Timing: validation summary
[2016-10-28T09:50Z] gridmaster: Timing: structural variation precall
[2016-10-28T09:50Z] gridmaster: ipython: detect_sv
[2016-10-28T09:50Z] gridmaster: Timing: structural variation
[2016-10-28T09:50Z] gridmaster: ipython: detect_sv
[2016-10-28T09:50Z] gridmaster: Timing: structural variation ensemble
[2016-10-28T09:50Z] gridmaster: ipython: detect_sv
[2016-10-28T09:50Z] gridmaster: Timing: structural variation validation
[2016-10-28T09:50Z] gridmaster: ipython: validate_sv
[2016-10-28T09:50Z] gridmaster: Timing: heterogeneity
[2016-10-28T09:50Z] gridmaster: ipython: heterogeneity_estimate
[2016-10-28T09:50Z] gridmaster: Timing: population database
[2016-10-28T09:50Z] gridmaster: ipython: prep_gemini_db
[2016-10-28T10:03Z] gridmaster: Timing: quality control

Is there something else that I need to do to install the snpEff database and / or to run snpEff?

Thank you.

NeillGibson commented 7 years ago

Hi @roryk . Did you already have a chance to look at what is going wrong with building the custom snpEff database ?

Or could you confirm that I am missing a piece of information about how snpEff with custom genomes is supposed to work in bcbio?

Thank very much!

roryk commented 7 years ago

Hi Neil,

I'm so sorry for not getting back to you, I suck. I saw it looked resolved but missed the rest of the problem. For custom genome we don't grab and pull down extra annotations like snpEff, because we don't know what they should be; we just use the provided GTF file and the genome and that is it.

I think you can add a snpEff database yourself though. The way bcbio finds the snpEff database is it looks for it in the snpeff directory under the name seq/build-resources.yaml file in the genome directory for your build. For example for human:

version: 26

aliases:
  human: true
  snpeff: GRCh37.75
  ensembl: homo_sapiens_vep_83_GRCh37

Let's say you named your build something evocative like Lyco2.5.

so if you stick the snpEff files in the Lyco2.5/snpeff/Lyco2.5 directory that match up with the build and add the snpeff alias in the seq/Lyco2.5-resources.yaml file it should pick up the annotations.

NeillGibson commented 7 years ago

Hi Rory,

No worries, thank you for the response.

I currently solved my issue by manually building the SnpEff database from the GFF3 file outside of bcbio and also running SnpEff outside of bcbio. This also works fine since it is just a single command to run on the final VCF file.

With the new information I can try to get SnpEff running under bcbio.

Thank you.

lpantano commented 7 years ago

Hi Neil,

I will close this now. Let us know if you find more issues.

Thanks!