SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Switch VT -> BCFtools for normalization - Structural Variant normalization #1014

Open davmlaw opened 5 months ago

davmlaw commented 5 months ago

VT doesn't handle symbolic alts, perhaps we should switch to using bcftools

I raised this as a problem and they only fixed it for DELs

https://github.com/samtools/bcftools/issues/1919

Find out the version affected, we should probably also raise an issue for DUPs

davmlaw commented 5 months ago

I raised an issue for <DUP> but in the meantime - perhaps we should just convert symbolic variants into explicit before normalization?

davmlaw commented 4 months ago

BCFTools added support for del/dup recently

It seems like both tools don't look for SVLEN when handling symbolic variants:

BCFTools

VT

So I have diisabled uniq for now

davmlaw commented 3 months ago

This has been fixed in bcftools, I think we can replace vt with bcftools in vcf_preprocess now:

http://samtools.github.io/bcftools/howtos/install.html

Should be able to look at file history or commented out bits to add it back

davmlaw commented 3 months ago

Working in branch feature/issue_1014_bcftools_normalize

TODO:

REF_MISMATCH    NC_000017.10    7676584 G   C
davmlaw commented 3 months ago

I should also write some tests for all of the different SV norm/unique fixes done and to just make sure they stay fixed and to they make it through an import pipeline

davmlaw commented 3 months ago

Need to modify ModifiedImportedVariant for bcftools

At the moment the modified imported variant is attached to "preprocess VCF" which doesn't have tool set. Will move it to "normalize" then we can determine whether it was done against VT or BCFTools going forward

EmmaTudini commented 3 months ago

@davmlaw Will this be turned off in the next deploy, if we're disabling BCF tools? And what is prod currently doing?

davmlaw commented 3 months ago

@EmmaTudini - there are 2 separate things here:

VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

TheMadBug commented 2 months ago

@davmlaw for my understanding, can you show me in relation to something like https://test.shariant.org.au/classification/hgvs_resolution_tool?genome_build=GRCh38&hgvs=NM_033071.3%28SYNE1%29%3Ac.21711G%3EA where BCFTools comes into it? As opposed to what pyhgvs or BioCommons does?

I see in the liftover pipeline there's a step normalize that uses bcf tools, and a step vcf_clean_and_filter that also uses it.

And previously a similar step was being performed by VT?

So does pyhgvs or BioCommons resolve to a variant coordinate, then bcf tools resolves that variant coordinate to its more normalised consistent format (and in many cases would leave it alone if it's already in its best representation)?

davmlaw commented 2 months ago

BCFTools is doing what VT did - normalize VCF files (left aligned)

HGVS standard requires normalization (3' - ie "stranded downstream") which is done in its own libraries (pyHGVS and Biocommons)

davmlaw commented 2 months ago

I don't think we need to fix this as no prod environment would have ingested symbolic variants yet, but I found one in vg test that got moved after normalization:

NC_000021.9:g.43095483_43104346del

4966530 - existing one
5117588 - g.HGVS is NC_000021.9:g.5001137_5010000del records being "normalized by bcftools NC_000021.9|43095482|A|<DEL>"
davmlaw commented 2 months ago

Discussed in Shariant meeting - to test bcftools normalization difference vs existing ones with VT we should dump out variants from DB, run through bcftools and look for changes. This has been split off issue for testing: SACGF/variantgrid_private#3658

Other than the above, testing for this involves:

davmlaw commented 2 months ago

In Shariant testing, we found this:

https://github.com/samtools/bcftools/issues/2216

Will be able to work around it by always writing an END for symbolic variants

davmlaw commented 2 months ago

I switched from using END to SVLEN on 02/21/2024 in SACGF/variantgrid#991

I tested whether it was the same in VEP

I also used test files in bcftools normalization which were generated with END from before this change (it took a long time from me raising the issue with bcftools for it to make it into VG)

Just need to put END back

davmlaw commented 2 months ago

Rework to add END to write_vcf_from_tuples

This method has grown to be really unwieldy by tacking stuff on - it takes 2 different types of tuples. All callers to this use VariantCoordinate instead of tuples EXCEPT create liftover pipelines (which passes tuples_have_id_field=True)

We should change it to just take VariantCoordinate, and maybe a parallel list of IDs (or maybe add an ID to VariantCoordinate?)

But - need to change all the weird different kinds of tuples in the liftover code to even try this:

Allele.get_liftover_tuple
snpdb.liftover._get_build_liftover_tuples
davmlaw commented 2 months ago

OK, have done refactoring to go full in on VariantCoordinate and get rid of weird mixed tuples

Working in brach feature/issue_1014_refactor_variant_tuple_add_end

Need to test a whole bunch of stuff (see commit)

davmlaw commented 2 months ago

ok, fixed going foward

Still need to find/fix historical ones. It may be a real pain to fix these, may be better to just delete them manually as it has only been on test/dev systems. Something like:

from snpdb.models import Variant
from annotation.models import AnnotationRangeLock

bad_norm = Variant.objects.filter(modifiedimportedvariant__isnull=False, locus__position=1, svlen__isnull=False)
for v in bad_norm:
    AnnotationRangeLock.release_variant(v)
    print(v.delete())

Will probably need something extra to handle imported alleles etc

EmmaTudini commented 2 months ago

@davmlaw Have commented on this issue https://github.com/SACGF/variantgrid_private/issues/3645#issuecomment-2208109928 - think the issues I've found might be related to this change

EmmaTudini commented 2 months ago

@davmlaw In case you haven't seen, I've raised another comment on https://github.com/SACGF/variantgrid_private/issues/3645

davmlaw commented 1 month ago

I found a bug due to my incorrect calculation of Variant.END (needed by bcftools) - done in SACGF/variantgrid_private#3676

During/after that have been testing bcftools normalization pretty thoroughly and seems to all work fine

EmmaTudini commented 3 weeks ago

Re-opening this issue, to help with my testing