Closed sci-kai closed 1 year ago
I wanted to write out the issues properly and could add a "fix" myself, but wanted your response first :)
Hi @feiloo . Thanks for all the helpful comments! I improved my python script and rewrote most of it to increase the separation of concern and readability. I tried to address most of your comments. Can you have a look at it again?
Hi, following features are added in this PR:
format_field
andinfo_fields
and renamedextraction_field
toannotation_fields
. Now you can give the keys for the VCF FORMAT and INFO column fields to also extract them and still extract CSQ annotation from the INFO field separately.Also I want to suggest the following defaults:
allele_fraction: 'FORMAT_AD'
. Most VCF files have these two FORMAT fields)annotation_fields: 'all'
. Extract all annotation fields by defaultformat_fields: GT, AD[0], AD[1], DP
. Gives the genotype, read number for REF [0] and ALT [1] allele and the depth (coverage)info_fields: null
.We could add a feature to extract all FORMAT and INFO fields, but this may be more complicated compared to CSQ fields and also not necessary, as most other Information is very complex and not necessarily needed for the interpreter. If so, he can add it manually through these parameters.
I tested the code manually with my test dataset and several combinations of the parameters.