arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

sqlite3 8-bit bytestring error while loading vcf into gemini #904

Open mbootwalla opened 6 years ago

mbootwalla commented 6 years ago

Hello Gemini developers,

I am running into the following error when trying to load a vcf into gemini:

insert error trying 1 at a time: Traceback (most recent call last): File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/bin/gemini", line 7, in <module> gemini_main.main() File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1248, in main args.func(parser, args) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn gemini_load_chunk.load(parser, args) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 914, in load gemini_loader.populate_from_vcf() File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 244, in populate_from_vcf database.insert_variation(self.c, self.metadata, self.var_buffer) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/database.py", line 452, in insert_variation trans.execute(stmt, b) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute return meth(self, multiparams, params) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement compiled_sql, distilled_params File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context context) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception exc_info File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context context) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'INSERT INTO variants (chrom, start, "end", vcf_id, variant_id, anno_id, ref, alt, qual, filter, type, sub_type, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_alt_freqs, gt_quals, gt_copy_numbers, gt_phred_ll_homref, gt_phred_ll_het, gt_phred_ll_homalt, call_rate, max_aaf_all, in_dbsnp, rs_ids, sv_cipos_start_left, sv_cipos_end_left, sv_cipos_start_right, sv_cipos_end_right, sv_length, sv_is_precise, sv_tool, sv_evidence_type, sv_event_id, sv_mate_id, sv_strand, in_omim, clinvar_sig, clinvar_disease_name, clinvar_dbsource, clinvar_dbsource_id, clinvar_origin, clinvar_dsdb, clinvar_dsdbid, clinvar_disease_acc, clinvar_in_locus_spec_db, clinvar_on_diag_assay, clinvar_causal_allele, clinvar_gene_phenotype, geno2mp_hpo_ct, pfam_domain, cyto_band, rmsk, in_cpg_island, in_segdup, is_conserved, gerp_bp_score, gerp_element_pval, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, hwe, inbreeding_coeff, pi, recomb_rate, gene, transcript, is_exonic, is_coding, is_splicing, is_lof, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, anc_allele, rms_bq, cigar, depth, strand_bias, rms_map_qual, in_hom_run, num_mapq_zero, num_alleles, num_reads_w_dels, haplotype_score, qual_depth, allele_count, allele_bal, in_hm2, in_hm3, is_somatic, somatic_score, in_esp, aaf_esp_ea, aaf_esp_aa, aaf_esp_all, exome_chip, in_1kg, aaf_1kg_amr, aaf_1kg_eas, aaf_1kg_sas, aaf_1kg_afr, aaf_1kg_eur, aaf_1kg_all, grc, gms_illumina, gms_solid, gms_iontorrent, in_cse, encode_tfbs, "encode_dnaseI_cell_count", "encode_dnaseI_cell_list", encode_consensus_gm12878, encode_consensus_h1hesc, encode_consensus_helas3, encode_consensus_hepg2, encode_consensus_huvec, encode_consensus_k562, vista_enhancers, cosmic_ids, info, cadd_raw, cadd_scaled, fitcons, in_exac, aaf_exac_all, aaf_adj_exac_all, aaf_adj_exac_afr, aaf_adj_exac_amr, aaf_adj_exac_eas, aaf_adj_exac_fin, aaf_adj_exac_nfe, aaf_adj_exac_oth, aaf_adj_exac_sas, exac_num_het, exac_num_hom_alt, exac_num_chroms, aaf_gnomad_all, aaf_gnomad_afr, aaf_gnomad_amr, aaf_gnomad_asj, aaf_gnomad_eas, aaf_gnomad_fin, aaf_gnomad_nfe, aaf_gnomad_oth, aaf_gnomad_sas, gnomad_num_het, gnomad_num_hom_alt, gnomad_num_chroms, vep_canonical, vep_ccds) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (u'chr9', 135781072, 135781077, u'rs118203595', 55004, 1, u'GCTTT', u'G', 0.0, None, 'indel', 'del', <read-only buffer for 0xcff3060, size -1, offset 0 at 0x2ab47f13bef0>, <read-only buffer for 0xcff3120, size -1, offset 0 at 0x2ab47f13bf30>, <read-only buffer for 0xcff31e0, size -1, offset 0 at 0x2ab47f13bf70>, <read-only buffer for 0xcff32a0, size -1, offset 0 at 0x2ab47f13bfb0>, <read-only buffer for 0xcff3360, size -1, offset 0 at 0x2ab47f1e9030>, <read-only buffer for 0xcff3420, size -1, offset 0 at 0x2ab47f1e9070>, <read-only buffer for 0xcff34e0, size -1, offset 0 at 0x2ab47f1e90b0>, <read-only buffer for 0xcff35a0, size -1, offset 0 at 0x2ab47f1e90f0>, <read-only buffer for 0xcff3660, size -1, offset 0 at 0x2ab47f1e9130>, <read-only buffer for 0xcff98b0, size -1, offset 0 at 0x2ab47f1e9170>, <read-only buffer for 0xcff9970, size -1, offset 0 at 0x2ab47f1e91b0>, <read-only buffer for 0xcff9a30, size -1, offset 0 at 0x2ab47f1e91f0>, 1.0, -1.0, 1, 'rs118203595', None, None, None, None, None, 1, None, None, None, None, None, 0, 'not-provided,pathogenic', 'Tuberous_sclerosis_1|Tuberous_sclerosis_syndrome|not_provided', 'OMIM_Allelic_Variant|Tuberous_sclerosis_database_(TSC1)|Tuberous_sclerosis_database_(TSC1)', '605284.0001|TSC1_00116|TSC1_00116\xc2\xa0', 'germline', 'MedGen:OMIM|MedGen:Orphanet:SNOMED_CT|MedGen', 'C1854465:191100|C0041341:ORPHA805:7199000|CN221809', 'RCV000005403.3|RCV000042099.2|RCV000189868.2', 1, 0, 'G', u'adenoma_sebaceum|autism_spectrum_disorders|cardiac_rhabdomyoma|cortical_dysplasia|cortical_tubers|focal_cortical_dysplasia_of_taylor|focal_cortical_ ... (104 characters truncated) ... nant_tumor_of_urinary_bladder|multiple_renal_cysts|renal_cortical_cysts|renal_insufficiency|seizures|tuberous_sclerosis_1|tuberous_sclerosis_syndrome', -1, 'Hamartin', 'chr9q34.13', None, 0, 0, 1, None, 1.54009e-218, 1, 0, 0, 0, 0.0, 1.0, None, 0.0, 0.070327, u'TSC1', u'ENST00000298552', 1, 1, 0, 1, u'15/23', u'AAAGca/ca', u'KA/X', u'630-631/1164', u'protein_coding', u'frameshift_variant', u'frameshift_variant', 'HIGH', u'', None, u'', None, None, None, u'1M4D', None, None, 60.0, None, None, None, None, None, None, None, None, None, None, None, None, 0, -1.0, -1.0, -1.0, 0, 0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, 0, None, None, None, 'T', 'T', 'T', 'T', 'T', 'T', None, None, <read-only buffer for 0xcff36a0, size -1, offset 0 at 0x2ab47f1e9230>, None, None, 0.706548, 0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, u'YES', u'CCDS6956.1')]

Here's the offending line from the vcf file:

9 135781073 rs118203595 GCTTT G 0 PASS CIGAR=1M4D;RU=CTTT;REFREP=2;IDREP=1;MQ=60;clinvar=1|pathogenic,1|not_provided,1|pathogenic;cosmic=1|COSM5010413;CSQT=1|TSC1|ENST00000298552.3|frameshift_variant,1|TSC1|NM_000368.4|frameshift_variant;CSQ=frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000298552|15/23|||630-631/1164|protein_coding|YES|CCDS6956.1,frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000440111|13/21|||630-631/1164|protein_coding||CCDS6956.1,downstream_gene_variant|||ENSG00000165699|TSC1|ENST00000493467|||||retained_intron||,frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000545250|12/20|||579-580/1113|protein_coding||CCDS55350.1,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_000368.4|15/23|||630-631/1164|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_001162426.1|15/23|||629-630/1163|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_001162427.1|14/22|||579-580/1113|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|XM_005272211.1|15/23|||630-631/1164|protein_coding|YES|,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|XM_005272212.1|14/22|||630-631/1164|protein_coding|| GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL 0/0:112:112:41:40,0:24,0:16,0:PASS:0,115,690

I can't seem to figure out what the issue is with the vcf file that is preventing it from being loaded into gemini. I googled the bytestring error and it seems to have something to do with compression when storing values as a blob. Here's a relevant post on Stack Overflow

I would really appreciate any help or insight into this issue and how I can fix my vcf so that it is compatible with gemini

Thanks,

Moiz

brentp commented 6 years ago

Hi, can you verify that you have the latest version of gemini?

mbootwalla commented 6 years ago

Hi Brent,

The version of gemini that I am using is 0.20.1.

Thanks,

Moiz

brentp commented 6 years ago

this should be fixed in master, could you give it a try?

mbootwalla commented 6 years ago

Hi Brent,

I upgraded to the latest devel version of gemini 0.20.2-dev (upgraded both gemini and associated data) but I still get the same error as above and at the same variant as above. Additionally I see the following warning pop up multiple times for the same set of consequences:

WARNING: unknown severity for 'frameshift_variant&start_lost&start_retained_variant|Atg/tg|M/X|388697|HRNR|NM_001009931.2|2/3|||1/2850|prot ein_coding|YES|'. using LOW for [u'frameshift_variant', u'start_lost', u'start_retained_variant']

Let me know if you need any further information to help debug this.

Thanks,

Moiz

mbootwalla commented 6 years ago

Hi @brentp. Any updates regarding the above issue?