arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

sqlite3 8-bit bytestring error while loading vcf into gemini #904

Open mbootwalla opened 6 years ago

mbootwalla commented 6 years ago

Hello Gemini developers,

I am running into the following error when trying to load a vcf into gemini:

insert error trying 1 at a time: Traceback (most recent call last): File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/bin/gemini", line 7, in <module> gemini_main.main() File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1248, in main args.func(parser, args) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 311, in loadchunk_fn gemini_load_chunk.load(parser, args) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 914, in load gemini_loader.populate_from_vcf() File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 244, in populate_from_vcf database.insert_variation(self.c, self.metadata, self.var_buffer) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/gemini/database.py", line 452, in insert_variation trans.execute(stmt, b) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute return meth(self, multiparams, params) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement compiled_sql, distilled_params File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context context) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception exc_info File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context context) File "/gpfs/fs1/data/bcbio_data_1.0.4/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'INSERT INTO variants (chrom, start, "end", vcf_id, variant_id, anno_id, ref, alt, qual, filter, type, sub_type, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_alt_freqs, gt_quals, gt_copy_numbers, gt_phred_ll_homref, gt_phred_ll_het, gt_phred_ll_homalt, call_rate, max_aaf_all, in_dbsnp, rs_ids, sv_cipos_start_left, sv_cipos_end_left, sv_cipos_start_right, sv_cipos_end_right, sv_length, sv_is_precise, sv_tool, sv_evidence_type, sv_event_id, sv_mate_id, sv_strand, in_omim, clinvar_sig, clinvar_disease_name, clinvar_dbsource, clinvar_dbsource_id, clinvar_origin, clinvar_dsdb, clinvar_dsdbid, clinvar_disease_acc, clinvar_in_locus_spec_db, clinvar_on_diag_assay, clinvar_causal_allele, clinvar_gene_phenotype, geno2mp_hpo_ct, pfam_domain, cyto_band, rmsk, in_cpg_island, in_segdup, is_conserved, gerp_bp_score, gerp_element_pval, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, hwe, inbreeding_coeff, pi, recomb_rate, gene, transcript, is_exonic, is_coding, is_splicing, is_lof, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, anc_allele, rms_bq, cigar, depth, strand_bias, rms_map_qual, in_hom_run, num_mapq_zero, num_alleles, num_reads_w_dels, haplotype_score, qual_depth, allele_count, allele_bal, in_hm2, in_hm3, is_somatic, somatic_score, in_esp, aaf_esp_ea, aaf_esp_aa, aaf_esp_all, exome_chip, in_1kg, aaf_1kg_amr, aaf_1kg_eas, aaf_1kg_sas, aaf_1kg_afr, aaf_1kg_eur, aaf_1kg_all, grc, gms_illumina, gms_solid, gms_iontorrent, in_cse, encode_tfbs, "encode_dnaseI_cell_count", "encode_dnaseI_cell_list", encode_consensus_gm12878, encode_consensus_h1hesc, encode_consensus_helas3, encode_consensus_hepg2, encode_consensus_huvec, encode_consensus_k562, vista_enhancers, cosmic_ids, info, cadd_raw, cadd_scaled, fitcons, in_exac, aaf_exac_all, aaf_adj_exac_all, aaf_adj_exac_afr, aaf_adj_exac_amr, aaf_adj_exac_eas, aaf_adj_exac_fin, aaf_adj_exac_nfe, aaf_adj_exac_oth, aaf_adj_exac_sas, exac_num_het, exac_num_hom_alt, exac_num_chroms, aaf_gnomad_all, aaf_gnomad_afr, aaf_gnomad_amr, aaf_gnomad_asj, aaf_gnomad_eas, aaf_gnomad_fin, aaf_gnomad_nfe, aaf_gnomad_oth, aaf_gnomad_sas, gnomad_num_het, gnomad_num_hom_alt, gnomad_num_chroms, vep_canonical, vep_ccdsparameters: (u'chr9', 135781072, 135781077, u'rs118203595', 55004, 1, u'GCTTT', u'G', 0.0, None, 'indel', 'del', <read-only buffer for 0xcff3060, size -1, offset 0 at 0x2ab47f13bef0>, <read-only buffer for 0xcff3120, size -1, offset 0 at 0x2ab47f13bf30>, <read-only buffer for 0xcff31e0, size -1, offset 0 at 0x2ab47f13bf70>, <read-only buffer for 0xcff32a0, size -1, offset 0 at 0x2ab47f13bfb0>, <read-only buffer for 0xcff3360, size -1, offset 0 at 0x2ab47f1e9030>, <read-only buffer for 0xcff3420, size -1, offset 0 at 0x2ab47f1e9070>, <read-only buffer for 0xcff34e0, size -1, offset 0 at 0x2ab47f1e90b0>, <read-only buffer for 0xcff35a0, size -1, offset 0 at 0x2ab47f1e90f0>, <read-only buffer for 0xcff3660, size -1, offset 0 at 0x2ab47f1e9130>, <read-only buffer for 0xcff98b0, size -1, offset 0 at 0x2ab47f1e9170>, <read-only buffer for 0xcff9970, size -1, offset 0 at 0x2ab47f1e91b0>, <read-only buffer for 0xcff9a30, size -1, offset 0 at 0x2ab47f1e91f0>, 1.0, -1.0, 1, 'rs118203595', None, None, None, None, None, 1, None, None, None, None, None, 0, 'not-provided,pathogenic', 'Tuberous_sclerosis_1|Tuberous_sclerosis_syndrome|not_provided', 'OMIM_Allelic_Variant|Tuberous_sclerosis_database_(TSC1)|Tuberous_sclerosis_database_(TSC1)', '605284.0001|TSC1_00116|TSC1_00116\xc2\xa0', 'germline', 'MedGen:OMIM|MedGen:Orphanet:SNOMED_CT|MedGen', 'C1854465:191100|C0041341:ORPHA805:7199000|CN221809', 'RCV000005403.3|RCV000042099.2|RCV000189868.2', 1, 0, 'G', u'adenoma_sebaceum|autism_spectrum_disorders|cardiac_rhabdomyoma|cortical_dysplasia|cortical_tubers|focal_cortical_dysplasia_of_taylor|focal_cortical_ ... (104 characters truncated) ... nant_tumor_of_urinary_bladder|multiple_renal_cysts|renal_cortical_cysts|renal_insufficiency|seizures|tuberous_sclerosis_1|tuberous_sclerosis_syndrome', -1, 'Hamartin', 'chr9q34.13', None, 0, 0, 1, None, 1.54009e-218, 1, 0, 0, 0, 0.0, 1.0, None, 0.0, 0.070327, u'TSC1', u'ENST00000298552', 1, 1, 0, 1, u'15/23', u'AAAGca/ca', u'KA/X', u'630-631/1164', u'protein_coding', u'frameshift_variant', u'frameshift_variant', 'HIGH', u'', None, u'', None, None, None, u'1M4D', None, None, 60.0, None, None, None, None, None, None, None, None, None, None, None, None, 0, -1.0, -1.0, -1.0, 0, 0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, 0, None, None, None, 'T', 'T', 'T', 'T', 'T', 'T', None, None, <read-only buffer for 0xcff36a0, size -1, offset 0 at 0x2ab47f1e9230>, None, None, 0.706548, 0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1, -1, -1, u'YES', u'CCDS6956.1')]

Here's the offending line from the vcf file:

9 135781073 rs118203595 GCTTT G 0 PASS CIGAR=1M4D;RU=CTTT;REFREP=2;IDREP=1;MQ=60;clinvar=1|pathogenic,1|not_provided,1|pathogenic;cosmic=1|COSM5010413;CSQT=1|TSC1|ENST00000298552.3|frameshift_variant,1|TSC1|NM_000368.4|frameshift_variant;CSQ=frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000298552|15/23|||630-631/1164|protein_coding|YES|CCDS6956.1,frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000440111|13/21|||630-631/1164|protein_coding||CCDS6956.1,downstream_gene_variant|||ENSG00000165699|TSC1|ENST00000493467|||||retained_intron||,frameshift_variant|AAAGca/ca|KA/X|ENSG00000165699|TSC1|ENST00000545250|12/20|||579-580/1113|protein_coding||CCDS55350.1,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_000368.4|15/23|||630-631/1164|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_001162426.1|15/23|||629-630/1163|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|NM_001162427.1|14/22|||579-580/1113|protein_coding||,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|XM_005272211.1|15/23|||630-631/1164|protein_coding|YES|,frameshift_variant|AAAGca/ca|KA/X|7248|TSC1|XM_005272212.1|14/22|||630-631/1164|protein_coding|| GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL 0/0:112:112:41:40,0:24,0:16,0:PASS:0,115,690

I can't seem to figure out what the issue is with the vcf file that is preventing it from being loaded into gemini. I googled the bytestring error and it seems to have something to do with compression when storing values as a blob. Here's a relevant post on Stack Overflow

I would really appreciate any help or insight into this issue and how I can fix my vcf so that it is compatible with gemini

Thanks,

Moiz

brentp commented 6 years ago

Hi, can you verify that you have the latest version of gemini?

mbootwalla commented 6 years ago

Hi Brent,

The version of gemini that I am using is 0.20.1.

Thanks,

Moiz

brentp commented 6 years ago

this should be fixed in master, could you give it a try?

mbootwalla commented 6 years ago

Hi Brent,

I upgraded to the latest devel version of gemini 0.20.2-dev (upgraded both gemini and associated data) but I still get the same error as above and at the same variant as above. Additionally I see the following warning pop up multiple times for the same set of consequences:

WARNING: unknown severity for 'frameshift_variant&start_lost&start_retained_variant|Atg/tg|M/X|388697|HRNR|NM_001009931.2|2/3|||1/2850|prot ein_coding|YES|'. using LOW for [u'frameshift_variant', u'start_lost', u'start_retained_variant']

Let me know if you need any further information to help debug this.

Thanks,

Moiz

mbootwalla commented 6 years ago

Hi @brentp. Any updates regarding the above issue?