arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 118 forks source link

Recommendations for working with VCF 4.2 #884

Closed stephenwilliams22 closed 6 years ago

stephenwilliams22 commented 6 years ago

Hi Aaron and Brent, On the the gemini main page you sate "GEMINI is very strict about adherence to VCF format 4.1." However, with the recent update to GATK4 the default, and unmodifiable, output of HaplotypeCaller is VCF 4.2. Do you all have recommendations for working with VCF 4.2? This is causing me a ton of trouble right now.

Find my error output below. I have 2TB of disc space available so I think that this much have to do with the second error "IOError: /dev/stdin if not valid bcf or vcf".

Any help would be greatly appreciated!

insert error trying 1 at a time:
sqlalchemy.OperationalError: (sqlite3.OperationalError) database or disk is full [SQL: u'INSERT INTO variants (chrom, start, "end", vcf_id, variant_id, anno_id, ref, alt, qual, filter, type, sub_type, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_alt_freqs, gt_quals, gt_copy_numbers, gt_phred_ll_homref, gt_phred_ll_het, gt_phred_ll_homalt, call_rate, max_aaf_all, in_dbsnp, rs_ids, sv_cipos_start_left, sv_cipos_end_left, sv_cipos_start_right, sv_cipos_end_right, sv_length, sv_is_precise, sv_tool, sv_evidence_type, sv_event_id, sv_mate_id, sv_strand, in_omim, clinvar_sig, clinvar_disease_name, clinvar_dbsource, clinvar_dbsource_id, clinvar_origin, clinvar_dsdb, clinvar_dsdbid, clinvar_disease_acc, clinvar_in_locus_spec_db, clinvar_on_diag_assay, clinvar_causal_allele, clinvar_gene_phenotype, geno2mp_hpo_ct, pfam_domain, cyto_band, rmsk, in_cpg_island, in_segdup, is_conserved, gerp_bp_score, gerp_element_pval, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, hwe, inbreeding_coeff, pi, recomb_rate, gene, transcript, is_exonic, is_coding, is_splicing, is_lof, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, anc_allele, rms_bq, cigar, depth, strand_bias, rms_map_qual, in_hom_run, num_mapq_zero, num_alleles, num_reads_w_dels, haplotype_score, qual_depth, allele_count, allele_bal, in_hm2, in_hm3, is_somatic, somatic_score, in_esp, aaf_esp_ea, aaf_esp_aa, aaf_esp_all, exome_chip, in_1kg, aaf_1kg_amr, aaf_1kg_eas, aaf_1kg_sas, aaf_1kg_afr, aaf_1kg_eur, aaf_1kg_all, grc, gms_illumina, gms_solid, gms_iontorrent, in_cse, encode_tfbs, "encode_dnaseI_cell_count", "encode_dnaseI_cell_list", encode_consensus_gm12878, encode_consensus_h1hesc, encode_consensus_helas3, encode_consensus_hepg2, encode_consensus_huvec, encode_consensus_k562, vista_enhancers, cosmic_ids, info, cadd_raw, cadd_scaled, fitcons, in_exac, aaf_exac_all, aaf_adj_exac_all, aaf_adj_exac_afr, aaf_adj_exac_amr, aaf_adj_exac_eas, aaf_adj_exac_fin, aaf_adj_exac_nfe, aaf_adj_exac_oth, aaf_adj_exac_sas, exac_num_het, exac_num_hom_alt, exac_num_chroms, aaf_gnomad_all, aaf_gnomad_afr, aaf_gnomad_amr, aaf_gnomad_asj, aaf_gnomad_eas, aaf_gnomad_fin, aaf_gnomad_nfe, aaf_gnomad_oth, aaf_gnomad_sas, gnomad_num_het, gnomad_num_hom_alt, gnomad_num_chroms, vep_canonical, vep_ccds) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (u'chr20', 8731727, 8731728, None, 440751, 1, u'G', u'A', 132.02999877929688, None, 'snp', 'ts', <read-only buffer for 0x117febf0, size -1, offset 0 at 0x7efcfa8974b0>, <read-only buffer for 0x1b856e50, size -1, offset 0 at 0x7efcfa8970b0>, <read-only buffer for 0x108dd000, size -1, offset 0 at 0x7efcfa897170>, <read-only buffer for 0x1692c7c0, size -1, offset 0 at 0x7efcfa897270>, <read-only buffer for 0x12ec0910, size -1, offset 0 at 0x7efcfa8972b0>, <read-only buffer for 0x1b200e50, size -1, offset 0 at 0x7efcfa8973b0>, <read-only buffer for 0x116a6590, size -1, offset 0 at 0x7efcfa8978b0>, <read-only buffer for 0x1a5f65f0, size -1, offset 0 at 0x7efcfa897930>, <read-only buffer for 0x12e0fab0, size -1, offset 0 at 0x7efcfa897970>, <read-only buffer for 0x12e0fb30, size -1, offset 0 at 0x7efcfa8979f0>, <read-only buffer for 0x22e3e2a0, size -1, offset 0 at 0x7efcfa897a30>, <read-only buffer for 0xe4741e0, size -1, offset 0 at 0x7efcfa897ab0>, 1.0, 0.6905, 1, 'rs6118300', None, None, None, None, None, 1, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, 0, 0, None, u'early_infantile_epileptic_encephalopathy\\x2c_autosomal_recessive|early_infantile_epileptic_encephalopathy_12', -1, None, 'chr20p12.3', 'SINE_Alu_AluY', 0, 0, 0, 0.49300000071525574, None, 0, 0, 1, 0, 1.0, 1.0, None, 0.0, 0.163384, u'PLCB1', u'ENST00000338037', 0, 0, 0, 0, '', '', '', u'', u'protein_coding', u'intron_variant', u'intron_variant', 'LOW', u'', None, u'', None, None, None, None, 4, None, 50.060001373291016, None, None, 2, None, None, 33.0099983215332, 2, None, None, None, None, None, 0, -1.0, -1.0, -1.0, 0, 1, 0.5634, 0.6905, 0.4213, 0.3585, 0.4115, 0.476637, None, None, None, None, 0, None, None, None, 'R', 'R', 'R', 'T', 'CTCF', 'R', None, None, <read-only buffer for 0x12e0faf0, size -1, offset 0 at 0x7efcfa897af0>, 0.46, 7.14, 0.055872, 0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, u'YES', u'CCDS13102.1')]
Traceback (most recent call last):
  File "/mnt/home/stephen/Apps/gemini_tools/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn
    gemini_load_chunk.load(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 910, in load
    gemini_loader = GeminiLoader(args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 100, in __init__
    self.vcf_reader = self._get_vcf_reader()
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 284, in _get_vcf_reader
    return vcf.VCFReader(self.args.vcf)
  File "cyvcf2/cyvcf2.pyx", line 183, in cyvcf2.cyvcf2.VCF.__init__ (cyvcf2/cyvcf2.c:7093)
IOError: /dev/stdin if not valid bcf or vcf
insert error trying 1 at a time:
sqlalchemy.OperationalError: (sqlite3.OperationalError) database or disk is full [SQL: u'INSERT INTO variants (chrom, start, "end", vcf_id, variant_id, anno_id, ref, alt, qual, filter, type, sub_type, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_alt_freqs, gt_quals, gt_copy_numbers, gt_phred_ll_homref, gt_phred_ll_het, gt_phred_ll_homalt, call_rate, max_aaf_all, in_dbsnp, rs_ids, sv_cipos_start_left, sv_cipos_end_left, sv_cipos_start_right, sv_cipos_end_right, sv_length, sv_is_precise, sv_tool, sv_evidence_type, sv_event_id, sv_mate_id, sv_strand, in_omim, clinvar_sig, clinvar_disease_name, clinvar_dbsource, clinvar_dbsource_id, clinvar_origin, clinvar_dsdb, clinvar_dsdbid, clinvar_disease_acc, clinvar_in_locus_spec_db, clinvar_on_diag_assay, clinvar_causal_allele, clinvar_gene_phenotype, geno2mp_hpo_ct, pfam_domain, cyto_band, rmsk, in_cpg_island, in_segdup, is_conserved, gerp_bp_score, gerp_element_pval, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, hwe, inbreeding_coeff, pi, recomb_rate, gene, transcript, is_exonic, is_coding, is_splicing, is_lof, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, anc_allele, rms_bq, cigar, depth, strand_bias, rms_map_qual, in_hom_run, num_mapq_zero, num_alleles, num_reads_w_dels, haplotype_score, qual_depth, allele_count, allele_bal, in_hm2, in_hm3, is_somatic, somatic_score, in_esp, aaf_esp_ea, aaf_esp_aa, aaf_esp_all, exome_chip, in_1kg, aaf_1kg_amr, aaf_1kg_eas, aaf_1kg_sas, aaf_1kg_afr, aaf_1kg_eur, aaf_1kg_all, grc, gms_illumina, gms_solid, gms_iontorrent, in_cse, encode_tfbs, "encode_dnaseI_cell_count", "encode_dnaseI_cell_list", encode_consensus_gm12878, encode_consensus_h1hesc, encode_consensus_helas3, encode_consensus_hepg2, encode_consensus_huvec, encode_consensus_k562, vista_enhancers, cosmic_ids, info, cadd_raw, cadd_scaled, fitcons, in_exac, aaf_exac_all, aaf_adj_exac_all, aaf_adj_exac_afr, aaf_adj_exac_amr, aaf_adj_exac_eas, aaf_adj_exac_fin, aaf_adj_exac_nfe, aaf_adj_exac_oth, aaf_adj_exac_sas, exac_num_het, exac_num_hom_alt, exac_num_chroms, aaf_gnomad_all, aaf_gnomad_afr, aaf_gnomad_amr, aaf_gnomad_asj, aaf_gnomad_eas, aaf_gnomad_fin, aaf_gnomad_nfe, aaf_gnomad_oth, aaf_gnomad_sas, gnomad_num_het, gnomad_num_hom_alt, gnomad_num_chroms, vep_canonical, vep_ccds) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (u'chr6', 38840600, 38840601, None, 646126, 1, u'G', u'A', 757.77001953125, None, 'snp', 'ts', <read-only buffer for 0xe1ddb60, size -1, offset 0 at 0x7fb076937370>, <read-only buffer for 0x185d0340, size -1, offset 0 at 0x7fb0769373b0>, <read-only buffer for 0x17c38cc0, size -1, offset 0 at 0x7fb076937270>, <read-only buffer for 0x1bd15130, size -1, offset 0 at 0x7fb0769372f0>, <read-only buffer for 0xe0bd140, size -1, offset 0 at 0x7fb0769373f0>, <read-only buffer for 0xe4d3350, size -1, offset 0 at 0x7fb076937430>, <read-only buffer for 0xe4d3410, size -1, offset 0 at 0x7fb0769374b0>, <read-only buffer for 0xdadf0e0, size -1, offset 0 at 0x7fb076937570>, <read-only buffer for 0xdadf1a0, size -1, offset 0 at 0x7fb0769375b0>, <read-only buffer for 0x190fbcb0, size -1, offset 0 at 0x7fb076937530>, <read-only buffer for 0x1b97ce30, size -1, offset 0 at 0x7fb076937230>, <read-only buffer for 0x1b97cef0, size -1, offset 0 at 0x7fb0769372b0>, 1.0, 0.3705158264947245, 1, 'rs2235719', None, None, None, None, None, 1, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, 0, 0, None, u'kartagener_syndrome|malignant_tumor_of_prostate|primary_ciliary_dyskinesia', -1, None, 'chr6p21.2', None, 0, 0, 0, 1.0, 1.55057e-124, 0, 1, 0, 0, 0.5, 0.31731050767247415, -1.0, 1.0, 0.050053, u'DNAH8', u'ENST00000327475', 0, 0, 0, 0, '', '', '', u'', u'protein_coding', u'intron_variant', u'intron_variant', 'LOW', u'', None, u'', None, None, None, None, 52, None, 60.0, None, None, 2, None, None, 15.15999984741211, 1, None, None, None, None, None, 1, 0.23982321470109327, 0.05449591280653951, 0.17704968466389787, 0, 1, 0.2176, 0.3512, 0.3415, 0.0227, 0.1968, 0.213059, None, None, None, None, 0, None, None, None, 'R', 'unknown', 'R', 'R', 'R', 'T', None, 'COSN167761', <read-only buffer for 0x190fbc70, size -1, offset 0 at 0x7fb076937470>, -0.67, 0.08, 0.070013, 1, 0.236, 0.23674711821553918, 0.05113522202129797, 0.23621227887617066, 0.3634755953784485, 0.2016055740684641, 0.2311461568615626, 0.21777777777777776, 0.3218586546648826, 21074, 3696, 120238, 0.238878, 0.04639852786540484, 0.23937036366925998, 0.18038366336633663, 0.3705158264947245, 0.1977625405990617, 0.23670073256760424, 0.2086264346538319, 0.32288783445367075, 0, 7515, 243622, u'', u'')]
Traceback (most recent call last):
  File "/mnt/home/stephen/Apps/gemini_tools/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn
    gemini_load_chunk.load(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 910, in load
    gemini_loader = GeminiLoader(args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 100, in __init__
    self.vcf_reader = self._get_vcf_reader()
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 284, in _get_vcf_reader
    return vcf.VCFReader(self.args.vcf)
  File "cyvcf2/cyvcf2.pyx", line 183, in cyvcf2.cyvcf2.VCF.__init__ (cyvcf2/cyvcf2.c:7093)
IOError: /dev/stdin if not valid bcf or vcf
Traceback (most recent call last):
  File "/mnt/home/stephen/Apps/gemini_tools/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn
    gemini_load.load(parser, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 49, in load
    load_multicore(args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 93, in load_multicore
    chunks = load_chunks_multicore(grabix_file, args)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 264, in load_chunks_multicore
    wait_until_finished(procs)
  File "/mnt/home/stephen/Apps/gemini_data/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 359, in wait_until_finished
    raise ValueError("Processing failed on GEMINI chunk load")
ValueError: Processing failed on GEMINI chunk load
arq5x commented 6 years ago

He @stephenwilliams22 - I not that the sqlachemy error is database or disk is full: are you sure this isn't your problem?

stephenwilliams22 commented 6 years ago

Thanks for the response Aaron. My disc definitely isn't full (2TB free) and I'm using --passonly to limit the number of variants. I have run this exact sample with using freebayes (VCF 4.1) and gemini worked fine. When I switched to GATK (VCF 4.2) everything blew up.

Here's my exact script to load the gemini db

gemini load -v my.VEP.vcf \
    -t VEP \
    --cores 20 \
    --skip-gene-tables \
    --passonly \
      vep.hg19.db
arq5x commented 6 years ago

I think your tempdir is full. Try setting --tempdir to the same path that my.VEP.vcf is writing too.

stephenwilliams22 commented 6 years ago

Thanks Aaron, This seems to have gotten me over the first hump. Now I seem to have a new error. I recently upgraded using conda to 0.20.1 and am getting this error when trying to load the db.

Traceback (most recent call last):
  File "/mnt/home/stephen/miniconda2/envs/gemini_env/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/mnt/home/stephen/miniconda2/envs/gemini_env/lib/python2.7/site-packages/gemini/gemini_main.py", line 1248, in main
    args.func(parser, args)
  File "/mnt/home/stephen/miniconda2/envs/gemini_env/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn
    gemini_load.load(parser, args)
  File "/mnt/home/stephen/miniconda2/envs/gemini_env/lib/python2.7/site-packages/gemini/gemini_load.py", line 23, in load
    annos = annotations.get_anno_files(args)
  File "/mnt/home/stephen/miniconda2/envs/gemini_env/lib/python2.7/site-packages/gemini/annotations.py", line 22, in get_anno_files
    anno_dirname = config["annotation_dir"]
KeyError: 'annotation_dir'
stephenwilliams22 commented 6 years ago

Looks like a fresh gemini install may have cured what ails me. That being said, do you have any suggestions on things to look out for with vcf 4.2?

arq5x commented 6 years ago

To be honest, I am not aware of any issues with 4.2. Have you seen any so far?

stephenwilliams22 commented 6 years ago

I'm doing some comparisons now but looks okay at first glance. I'm going to close this issue. Thanks Aaron!

noprobllama1010 commented 1 year ago

I'm doing some comparisons now but looks okay at first glance. I'm going to close this issue. Thanks Aaron!

Hello! Did you notice any differences, or loss of information?