Open davmlaw opened 5 months ago
I raised an issue for <DUP>
but in the meantime - perhaps we should just convert symbolic variants into explicit before normalization?
This has been fixed in bcftools, I think we can replace vt with bcftools in vcf_preprocess now:
http://samtools.github.io/bcftools/howtos/install.html
Should be able to look at file history or commented out bits to add it back
Working in branch feature/issue_1014_bcftools_normalize
TODO:
--check-ref=w
dies due to contigs not being mentioned - need to handle it eg test data "17:7676584 G>AA"~REF_MISMATCH NC_000017.10 7676584 G C
I should also write some tests for all of the different SV norm/unique fixes done and to just make sure they stay fixed and to they make it through an import pipeline
Need to modify ModifiedImportedVariant for bcftools
At the moment the modified imported variant is attached to "preprocess VCF" which doesn't have tool set. Will move it to "normalize" then we can determine whether it was done against VT or BCFTools going forward
@davmlaw Will this be turned off in the next deploy, if we're disabling BCF tools? And what is prod currently doing?
@EmmaTudini - there are 2 separate things here:
VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018
BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.
So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)
For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release
@davmlaw for my understanding, can you show me in relation to something like https://test.shariant.org.au/classification/hgvs_resolution_tool?genome_build=GRCh38&hgvs=NM_033071.3%28SYNE1%29%3Ac.21711G%3EA where BCFTools comes into it? As opposed to what pyhgvs or BioCommons does?
I see in the liftover pipeline there's a step normalize
that uses bcf tools, and a step vcf_clean_and_filter
that also uses it.
And previously a similar step was being performed by VT?
So does pyhgvs or BioCommons resolve to a variant coordinate, then bcf tools resolves that variant coordinate to its more normalised consistent format (and in many cases would leave it alone if it's already in its best representation)?
BCFTools is doing what VT did - normalize VCF files (left aligned)
HGVS standard requires normalization (3' - ie "stranded downstream") which is done in its own libraries (pyHGVS and Biocommons)
I don't think we need to fix this as no prod environment would have ingested symbolic variants yet, but I found one in vg test that got moved after normalization:
NC_000021.9:g.43095483_43104346del
4966530 - existing one
5117588 - g.HGVS is NC_000021.9:g.5001137_5010000del records being "normalized by bcftools NC_000021.9|43095482|A|<DEL>"
Discussed in Shariant meeting - to test bcftools normalization difference vs existing ones with VT we should dump out variants from DB, run through bcftools and look for changes. This has been split off issue for testing: SACGF/variantgrid_private#3658
Other than the above, testing for this involves:
In Shariant testing, we found this:
https://github.com/samtools/bcftools/issues/2216
Will be able to work around it by always writing an END for symbolic variants
I switched from using END to SVLEN on 02/21/2024 in SACGF/variantgrid#991
I tested whether it was the same in VEP
I also used test files in bcftools normalization which were generated with END from before this change (it took a long time from me raising the issue with bcftools for it to make it into VG)
Just need to put END back
Rework to add END to write_vcf_from_tuples
This method has grown to be really unwieldy by tacking stuff on - it takes 2 different types of tuples. All callers to this use VariantCoordinate instead of tuples EXCEPT create liftover pipelines (which passes tuples_have_id_field=True)
We should change it to just take VariantCoordinate, and maybe a parallel list of IDs (or maybe add an ID to VariantCoordinate?)
But - need to change all the weird different kinds of tuples in the liftover code to even try this:
Allele.get_liftover_tuple
snpdb.liftover._get_build_liftover_tuples
OK, have done refactoring to go full in on VariantCoordinate and get rid of weird mixed tuples
Working in brach feature/issue_1014_refactor_variant_tuple_add_end
Need to test a whole bunch of stuff (see commit)
ok, fixed going foward
Still need to find/fix historical ones. It may be a real pain to fix these, may be better to just delete them manually as it has only been on test/dev systems. Something like:
from snpdb.models import Variant
from annotation.models import AnnotationRangeLock
bad_norm = Variant.objects.filter(modifiedimportedvariant__isnull=False, locus__position=1, svlen__isnull=False)
for v in bad_norm:
AnnotationRangeLock.release_variant(v)
print(v.delete())
Will probably need something extra to handle imported alleles etc
@davmlaw Have commented on this issue https://github.com/SACGF/variantgrid_private/issues/3645#issuecomment-2208109928 - think the issues I've found might be related to this change
@davmlaw In case you haven't seen, I've raised another comment on https://github.com/SACGF/variantgrid_private/issues/3645
I found a bug due to my incorrect calculation of Variant.END (needed by bcftools) - done in SACGF/variantgrid_private#3676
During/after that have been testing bcftools normalization pretty thoroughly and seems to all work fine
Re-opening this issue, to help with my testing
VT doesn't handle symbolic alts, perhaps we should switch to using bcftools
I raised this as a problem and they only fixed it for DELs
https://github.com/samtools/bcftools/issues/1919
Find out the version affected, we should probably also raise an issue for DUPs