genome-nexus / genome-nexus-annotation-pipeline

Library and tool for annotating MAF files using Genome Nexus Webserver API
MIT License
8 stars 27 forks source link

Allow not normalizing of bases in the MAF with command line option #232

Closed inodb closed 1 year ago

inodb commented 1 year ago

This is related to this issue:

https://github.com/mskcc/vcf2maf/issues/279

Basically occasionally you might want to keep the ref/alt allele bases because it gives you more information about the surrounding bases. There are three options for the base normalization:

  1. Strip off only the very first matching base from ref+alt (first)
  2. Strip off all matching starting bases from ref+alt (all) -- this is the current behavior
  3. Don't do any harmonization (do store all the appropriate fields as if it were normalized) (none)

This could be something that is relevant for both annotation-tools as well as genome-nexus-annotation-pipleine. The former does the vcf2maf conversion, but the latter also does harmonization of bases as well (the API returns harmonized version of chrom/pos/ref/alt). We should prolly add options to both those tools around this, so the annotation pipeline can have some option like this:

--strip-matching-bases {first,all,none}

And the annotation-tools could have something like:

--strip-matching-bases {first,all}

For annotation-tools it prolly doesn't make sense to have the "none" option since you are starting from the VCF file which by definition lists the additional base in ref and alt for indels

Note that the issue with using "first' is that if you run the MAF thru multiple times it will change every time until all bases are stripped off. This is not a big deal if you start from the source VCF, which is how it works for most internal pipelines at MSK, but it can be an issue when you use MAF as the source of truth file. Some way to capture immutable genomic locations was implemented previously but never merged so might be good to revisit that. Another option is to add some feature like that in the conversion script from VCF to MAF i.e. add the original VCF fields in the resulting MAF to make sure you don't lose the source of truth. Then whenever you re-annotate you use the source of truth fields rather than the potentially harmonized fields

Note: need to figure out what to do with matching ending bases