arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
317 stars 119 forks source link

Problem during gemini load #917

Closed CarlosGAH closed 5 years ago

CarlosGAH commented 5 years ago

Hi everybody. First of all, gemini is a great tool. I am having a problem loading a full genome vcf from a trio (father, mother and son). I am using gemini devel version 0.30 the comand is this gemini load -v filtered_normalized_annotated.vcf -p P_Trio.ped -t snpEff --cores 3 Trio_1.db

Everything goes smoothly until this error arises pid 8503: 239999 variants processed. pid 8506: 239999 variants processed. pid 8509: 249999 variants processed. Traceback (most recent call last): File "/usr/local/bin/gemini", line 7, in gemini_main.main() File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1249, in main args.func(parser, args) File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn gemini_load_chunk.load(parser, args) File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 918, in load gemini_loader.populate_from_vcf() File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 223, in populate_from_vcf (variant, variant_impacts) = self._prepare_variation(var, anno_keys) File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 407, in _prepare_variation clinvar_info = annotations.get_clinvar_info(var) File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 648, in get_clinvar_info clinvar.clinvar_sig = info_map['CLNSIG'].lower() KeyError: 'CLNSIG' pid 8506: 249999 variants processed. pid 8509: 259999 variants processed. pid 8506: 259999 variants processed.

It looks like that the chunk that is processed in pid 8503, fails completely and is not resumed in the process. I have the lastest versions of the annotations databases from gemini (update --dataonly). I have used other vcf from other trios (exome instead of genome), and this problem did not appear. Could it be a problem within the vcf (a problematic snp?) or a bug in the devel version of gemini? I am completely lost. As i said before, i have used the same thing with other trios vcfs (from exome) without any problem.

CarlosGAH commented 5 years ago

I forgot to mention, that error do not interrupt the loading, but at the end the proccess fails ValueError: Processing failed on GEMINI chunk load

brentp commented 5 years ago

thanks for reporting. can you report the output of ls clin* in your gemini data directory that contains all the vcfs and bed annotation files?

CarlosGAH commented 5 years ago

Here it is /usr/local/share/gemini/gemini_data$ ls clin* clinvar_20170130.tidy.vcf.gz clinvar_20170130.tidy.vcf.gz.tbi clinvar_20190102.tidy.vcf.gz clinvar_20190102.tidy.vcf.gz.tbi

brentp commented 5 years ago

would you try manually removing clinvar_20170130.tidy.vcf.gz and clinvar_20170130.tidy.vcf.gz.tbi to make sure gemini is not getting an older version?

CarlosGAH commented 5 years ago

ok, i will try and i will inform again

CarlosGAH commented 5 years ago

The same problem appears I have removed the archives as you said. But the problem persists This problem do not arise in smaller vcfs (for example trios from exome sequencing). But in this big vcfs (trios from whole genome), the problem appears. If i select a fraction of this big vcf (1/10 for example), the problem do not appear

viktorstr commented 5 years ago

Dear Brent, i have the same problem - many CLNSIG errors during vcf loading: Traceback (most recent call last): File "/home/viktor/gemini/bin/gemini", line 7, in gemini_main.main() File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1249, in main args.func(parser, args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn gemini_load_chunk.load(parser, args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 918, in load gemini_loader.populate_from_vcf() File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 223, in populate_from_vcf (variant, variant_impacts) = self._prepare_variation(var, anno_keys) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 407, in _prepare_variation clinvar_info = annotations.get_clinvar_info(var) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 648, in get_clinvar_info clinvar.clinvar_sig = info_map['CLNSIG'].lower() KeyError: 'CLNSIG'

and finally it crash at the end:

Traceback (most recent call last): File "/home/viktor/gemini/bin/gemini", line 7, in gemini_main.main() File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1249, in main args.func(parser, args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn gemini_load.load(parser, args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 49, in load load_multicore(args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 93, in load_multicore chunks = load_chunks_multicore(grabix_file, args) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 264, in load_chunks_multicore wait_until_finished(procs) File "/home/viktor/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 359, in wait_until_finished raise ValueError("Processing failed on GEMINI chunk load") ValueError: Processing failed on GEMINI chunk load

Any ideas ? Thanks

CarlosGAH commented 5 years ago

I have tried the same big vcf that gave me probems with gemini 20.1 (clean installation from gemini_install.py in a new computer, updated dataonly and cadd y gerp scores), and it gave no problems at all. The problem is that the databases are a bit old. There is a way to update only the databases (for example clinvar and dbSNP?)

brentp commented 5 years ago

can you get me the portion of the vcf that will recreate the error? I know you said on a small subset you do not see it, but you should be able to find 1 chunk of the file that gives the error. then I can debug and fix this problem for anyone who might encounter it.

viktorstr commented 5 years ago

There is link for test vcf download. Hope it helps.

http://www.uschovna.cz/en/zasilka/JT42PY9VDAMSXGTN-SGU/?set_lang=en Best regards Viktor

CarlosGAH commented 5 years ago

Here is the portion of the vcf that gave the problem/error gemini_problem.vcf.gz

brentp commented 5 years ago

thank you very much for the test-case. I have a push for this that will be pushed shortly.

brentp commented 5 years ago

this is now fixed in master and will be a part of the release.