arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

ValueError: Processing failed on GEMINI chunk load #744

Closed komalsrathi closed 8 years ago

komalsrathi commented 8 years ago

Hello,

I installed gemini v0.18.3 as follows:

wget https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py
/usr/bin/python gemini_install.py /home/rathik/tools /home/rathik/data

# I got these messages at the end of the installation:
    Installing base gemini package...
    /home/rathik/data/anaconda/bin/conda install --yes -c bioconda gemini
    /home/rathik/data/anaconda/bin/gemini --annotation-dir /home/rathik/data/gemini_data update --dataonly --tooldir /home/rathik/tools
    Finished: gemini, tools and data installed
    Tools installed in:
        /home/rathik/tools
    NOTE: be sure to add /home/rathik/tools/bin to your PATH.
    Data installed in: 
        /home/rathik/data

# post installation
which gemini
/home/rathik/tools/bin/gemini

gemini -v
gemini 0.18.3

# gemini command
gemini load --cores 3 -t snpEff -v sample.snpeff.vcf.bgz -p /home/rathik/ped_file/samples.ped sample.gemini.db

This is giving me the following error:

Bgzipping sample.snpeff.vcf.bgz into sample.snpeff.vcf.bgz.gz.
Indexing sample.snpeff.vcf.bgz.gz with grabix.
Loading 100262 variants.
Breaking sample.snpeff.vcf.bgz.gz into 3 chunks.
Loading chunk 0.
Loading chunk 1.
Loading chunk 2.
[E::bcf_hdr_read] invalid BCF2 magic string
[E::bcf_hdr_read] invalid BCF2 magic string
[E::bcf_hdr_read] invalid BCF2 magic string
/bin/sh: line 1: 78750 Done                    grabix grab sample.snpeff.vcf.bgz.gz 66841 100262
     78751 Segmentation fault      | gemini load_chunk -v - -t snpEff -p /home/rathik/grin_test/samples.ped --skip-gerp-bp --skip-cadd --skip-info-string --tempdir /mnt/lustre/users/rathik/scratch -o 66841 /mnt/lustre/users/rathik/scratch/sample.snpeff.vcf.bgz.chunk2.db
/bin/sh: line 1: 78747 Done                    grabix grab sample.snpeff.vcf.bgz.gz 33421 66840
     78748 Segmentation fault      | gemini load_chunk -v - -t snpEff -p /home/rathik/grin_test/samples.ped --skip-gerp-bp --skip-cadd --skip-info-string --tempdir /mnt/lustre/users/rathik/scratch -o 33421 /mnt/lustre/users/rathik/scratch/sample.snpeff.vcf.bgz.chunk1.db
/bin/sh: line 1: 78744 Done                    grabix grab sample.snpeff.vcf.bgz.gz 1 33420
     78745 Segmentation fault      | gemini load_chunk -v - -t snpEff -p /home/rathik/grin_test/samples.ped --skip-gerp-bp --skip-cadd --skip-info-string --tempdir /mnt/lustre/users/rathik/scratch -o 1 /mnt/lustre/users/rathik/scratch/sample.snpeff.vcf.bgz.chunk0.db
Traceback (most recent call last):
  File "/home/rathik/tools/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1185, in main
    args.func(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 198, in load_fn
    gemini_load.load(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 49, in load
    load_multicore(args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 74, in load_multicore
    chunks = load_chunks_multicore(grabix_file, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 252, in load_chunks_multicore
    wait_until_finished(procs)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 343, in wait_until_finished
    raise ValueError("Processing failed on GEMINI chunk load")
ValueError: Processing failed on GEMINI chunk load

Note I checked the data folder /home/rathik/data/gemini_data/ and by default, gemini loads hg19/GRCh37 databases whereas for my samples I am using hg38/GRCh38 as reference. Could that be the problem? If yes, how could one load all hg38 databases during installation?

More details To perform tests, I followed the steps below:

git clone https://github.com/arq5x/gemini.git
nohup cd gemini && bash master-test.sh > test_out.txt 2> test_log.txt &

I got the same error (line 85 in test_log.txt), the tests went past it but eventually failed. I attached the test.out and test.log file.

test_out.txt test_log.txt

Thanks, Komal

brentp commented 8 years ago

Your file name is sample.snpeff.vcf.bgz; please rename to sample.snpeff.vcf.gz and let us know if the problem persists.

komalsrathi commented 8 years ago

Hi,

My file was already bgzipped. I tried the same command on

  1. sample.snpeff.vcf
  2. sample.snpeff.vcf.gz (I gzipped it)
  3. renaming sample.snpeff.vcf.bgz to sample.snpeff.vcf.gz

All three give me the following error:

Traceback (most recent call last):
  File "/home/rathik/tools/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1185, in main
    args.func(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 198, in load_fn
    gemini_load.load(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 24, in load
    annos = annotations.get_anno_files(args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 17, in get_anno_files
    anno_dirname = config["annotation_dir"]
TypeError: 'NoneType' object has no attribute '__getitem__'
brentp commented 8 years ago

you'll need to bgzip the file. not gzip. can you show the full commands that you are running along with the full output.

komalsrathi commented 8 years ago

Thank you. These are my commands:

snpEff -Xmx32g -Xms16g -Djava.io.tmpdir=/users/rathik/scratch -c snpEff.GRCh38.config -ud 10 -classic GRCh38.82 sample.vcf > sample.snpeff.vcf
bgzip -c sample.snpeff.vcf > sample.snpeff.vcf.gz
tabix -p vcf sample.snpeff.vcf.gz
gemini load --cores 3 -t snpEff -v sample.snpeff.vcf.gz -p samples.ped sample.gemini.db

However, no output files are created and I am still getting the following error:

Traceback (most recent call last):
  File "/home/rathik/tools/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1185, in main
    args.func(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 198, in load_fn
    gemini_load.load(parser, args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 24, in load
    annos = annotations.get_anno_files(args)
  File "/home/rathik/data/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 17, in get_anno_files
    anno_dirname = config["annotation_dir"]
TypeError: 'NoneType' object has no attribute '__getitem__'

I would again like to reinstate that gemini_data has hg19 specific whereas my reference is hg38.

brentp commented 8 years ago

If you are trying to use gemini on genome other than hg19, you'll need to follow the example here: http://quinlanlab.org/blog/2016/05/02/gemini-2-progress.html

including setting up your own annotations.

komalsrathi commented 8 years ago

I missed a very important part while reading the documentation. Thank you for your prompt responses. I will close this issue for now.

All that is required is the researcher collect the annotation files relevant to the species and build of interest and create a vcfanno configuration file that dictates the exact annotation files and attributes that are desired.