KarchinLab / open-cravat

A modular annotation tool for genomic variants
MIT License
110 stars 27 forks source link

Numeric values in VCF file are not parsed properly #109

Closed bogdanovvp closed 1 year ago

bogdanovvp commented 2 years ago

Uploading vcfs to opencravat seems to result in incorrect parsing of the numeric values (likely parsed as strings), which leads to the hindered filtering image

The header of the VCF file is atached:

fileformat=VCFv4.2

FILTER=

FILTER=

FILTER=

FILTER=

INFO=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

bogdanovvp commented 2 years ago

Upd: this seems to be an issue in .sqlite generation. Changing types within the sqlite to: variant / vcfinfophred "text" -> "real" variant / vcfinfoalt_reads "text" -> "integer" variant / vcfinfo__tot_reads "text" -> "integer" variant / vcfinfo__af "text" -> "real"

And correcting the "type" values in the respective dictionaries in the "variant_header" table corrects the issue. The respective change should be implemented in the generating code.

bogdanovvp commented 2 years ago

Upd2: this recent pull request generally solves the issue https://github.com/KarchinLab/open-cravat-modules-karchinlab/pull/11

kmoad commented 1 year ago

Hi bogdanovvp. Thanks a lot for the digging here, and the PR.

Unfortunately, the changes won't work for some jobs. For variants found in more than one sample, those columns are ; delimited lists, and have to be strings. We are currently planning work on better sample/cohort filtering.

For example, consider a variant in two samples: s1, and s2. The base__sample_id column will be s1;s2, and vcfinfo__alt_reads will be something like 15;28.

If you look into the sample table, the column values are better. base__alt_reads is integer, base__tot_reads is integer, and base__af is real. If it's possible for you to query the db directly, you could try that. Or, if you know there's only one sample, the change in your PR works great. But it won't work as a general fix.

We're working on better filtering, and are gathering use-cases. If you're willing to discuss more, I'm interested to know what you're trying to use these columns for.

kmoad commented 1 year ago

This is fixed for single-sample vcfs here https://github.com/KarchinLab/open-cravat/issues/149