arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Excessive warning messages on extra VEP column loading in 0.16.3 #513

Open ddkinnamon opened 9 years ago

ddkinnamon commented 9 years ago

Hi,

I have noticed a couple of issues with excessive warnings when loading extra VEP columns in GEMINI 0.16.3. I see that the routine calls the annotate tool to load extra VEP columns. This would be fine, except the annotate tool generates a warning message for each record and column whenever a missing value is encountered:

WARNING: vep_hgvs_offset is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_motif_score_change is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_hgvsc is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_pubmed is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_trembl is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_uniparc is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_distance is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_intron is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_ensp is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_cds_position is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_hgvsp is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.
WARNING: vep_ccds is missing from INFO field in /tmp/pbstmp.4382548/extra.four.hundred.exomes.pp.ann.vcf.chunk0.db.vcf.gz for at least one record.

Although the warning is well-intentioned and useful for annotation files, its placement in the code leads to millions of lines of these messages on a multicore database load with hundreds of samples. I can parse the output to remove these warnings, but it might be better to move the warning message code so that it is output only once for a given column/database. In fact, it might also be useful to suppress it entirely when loading extra VEP fields, which are likely to have missing values. The only time I would want a warning is if the program found absolutely nothing in a particular VEP field.

There is also another type of warning message that appears and may have to do with the merging of database chunks into a single database:

WARNING: Column "(vep_canonical)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_ccds)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_cdna_position)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_cds_position)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_clin_sig)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_distance)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_domains)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_ensp)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_existing_variation)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_feature_type)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_hgnc_id)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_hgvsc)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_hgvs_offset)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_hgvsp)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_high_inf_pos)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_impact)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_intron)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_motif_name)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_motif_pos)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_motif_score_change)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_pubmed)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_somatic)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_strand)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_swissprot)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_symbol_source)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_trembl)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_uniparc)" already exists in variants table. Overwriting values.
WARNING: Column "(vep_variant_class)" already exists in variants table. Overwriting values.

The data seem to load just fine, and I suspect that the only values being overwritten are NULLs. Nonetheless, it's a bit disconcerting to have these error messages in my logs. Please advise.

Thanks, Dan

brentp commented 9 years ago

agreed. we'll get this into the next release.

ddkinnamon commented 9 years ago

Am I correct in assuming that these are safe to ignore, then?

brentp commented 9 years ago

Hi Dan, the first set are safe to ignore. It's just that some variants don't have, e.g. 'vep_pubmed' defined and it was overzealously reporting all cases where that happened.

For the latter, set like WARNING: Column "(vep_canonical)" already exists in variants table. Overwriting values.; you only see those once, correct? On quick look, it seem that should only occur when you're overwriting an existing database.

ddkinnamon commented 9 years ago

Each WARNING: Column "(vep_*)" already exists in variants table. Overwriting values. appears 5 times when I am running with 6 cores. These warnings appear for the first time before the set of "missing from INFO field" warnings for chunk 1 and then once before these warnings for each subsequent chunk.