Add allele fraction and coverage

sci-kai commented 1 year ago

Hi, following features are added in this PR:

TSV conversion now has additional parameters format_field and info_fields and renamed extraction_field to annotation_fields. Now you can give the keys for the VCF FORMAT and INFO column fields to also extract them and still extract CSQ annotation from the INFO field separately.
Reshaped the Formatting of vembrane fields and now using a python script.
Another new parameter "allele fraction" is added with several options to extract or calculare allele fractions:
- FORMAT_AD: calculate from FORMAT column by dividing AD with DP fields
- FORMAT_AF: extracting directly from AF field in FORMAT column
- mutect2: same as FORMAT_AF, additionally extracts DP column from INFO field (which may be a bit higher due to filtering ofuninformative reads).
- freebayes: same as FORMAT_AD
- strelka: similar to FORMAT_AD, but additionally dividing AD by DPI, a field with the coverage for InDels

Also I want to suggest the following defaults:

allele_fraction: 'FORMAT_AD'. Most VCF files have these two FORMAT fields)
annotation_fields: 'all'. Extract all annotation fields by default
format_fields: GT, AD[0], AD[1], DP. Gives the genotype, read number for REF [0] and ALT [1] allele and the depth (coverage)
info_fields: null.

We could add a feature to extract all FORMAT and INFO fields, but this may be more complicated compared to CSQ fields and also not necessary, as most other Information is very complex and not necessarily needed for the interpreter. If so, he can add it manually through these parameters.

I tested the code manually with my test dataset and several combinations of the parameters.

feiloo commented 1 year ago

I wanted to write out the issues properly and could add a "fix" myself, but wanted your response first :)

sci-kai commented 1 year ago

Hi @feiloo . Thanks for all the helpful comments! I improved my python script and rewrote most of it to increase the separation of concern and readability. I tried to address most of your comments. Can you have a look at it again?

cio-abcd / variantinterpretation

Add allele fraction and coverage #19