SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

BCFTools liftover - replace NCBI remap which is retired #817

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

https://ncbiinsights.ncbi.nlm.nih.gov/2023/05/04/remap-tool-to-retire/

davmlaw commented 1 year ago

Here's the numbers:

Shariant prod - NCBI linked 119 of 40620 alleles (0.292959133431807 %) SA Path - NCBI linked 743 of 181911 alleles (0.408441490619039 %)

from snpdb.models import VariantAllele, AlleleConversionTool

# note field is 'conversion_tool' in older code
va_ncbi = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP)
va = VariantAllele.objects.all()
perc = va_ncbi.count() * 100 / va.count()
print(f"NCBI linked {va_ncbi.count()} of {va.count()} alleles ({perc} %)")
davmlaw commented 1 year ago

Saving the 0.5% of ones that fail ClinGen allele registry is worth it, especially as that has a 2k limit and we're going to be using CNVs

So, first thing is to evaluate the options. There is the UCSC one and also:

SNPLift: Fast and accurate conversion of genetic variant coordinates across genome assemblies

Then we need to make NCBI liftover only work historically, and switch in the replacement.

davmlaw commented 10 months ago

To dump out our variants that used NCBI (for testing), use the following:

import pandas as pd

from socket import gethostname
from snpdb.models import GenomeBuild, VariantAllele, AlleleConversionTool
from genes.hgvs import HGVSMatcher

def get_ncbi_from_build(src_genome_build, dest_genome_build):
        src_matcher = HGVSMatcher(src_genome_build)
        dest_matcher = HGVSMatcher(dest_genome_build)
        src_ghgvs = f"{src_genome_build.name}_ghgvs"
        dest_ghgvs = f"{dest_genome_build.name}_ghgvs"

        liftover_from_build = []
        # old code = conversion_tool=
        va_ncbi_qs = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP, genome_build=dest_genome_build)
        for va in va_ncbi_qs:
            v_src = va.allele.variant_for_build(src_genome_build)
            v_dest = va.allele.variant_for_build(dest_genome_build)
            # If this fails need to use ._asdict().copy()
            data = v_src.coordinate.__dict__.copy()
            data["allele"] = va.allele_id
            data[src_ghgvs] = src_matcher.variant_to_g_hgvs(v_src)
            data[dest_ghgvs] = dest_matcher.variant_to_g_hgvs(v_dest)
            liftover_from_build.append(data)
        return liftover_from_build

def write_csv(src_genome_build, dest_genome_build):
        liftover_from_build = get_ncbi_from_build(src_genome_build, dest_genome_build)
        df = pd.DataFrame.from_records(liftover_from_build)
        filename = f"/tmp/{gethostname()}_ncbi_liftover_from_{src_genome_build}_to_{dest_genome_build}.csv"
        print(f"Wrote '{filename}'")
        df.to_csv(filename)

write_csv(GenomeBuild.grch37(), GenomeBuild.grch38())
write_csv(GenomeBuild.grch38(), GenomeBuild.grch37())
davmlaw commented 8 months ago

@YasirKusay if you could:

Within a few days I will provide you with CSVs of NCBI liftover (see comment above) results

YasirKusay commented 8 months ago

It was (relatively) ok to setup SNPLift.

Setup

Using it

davmlaw commented 8 months ago

Looks like service is now gone we should change settings in all environments:

davmlaw commented 8 months ago

I emailed you a list of NCBI liftovers from Shariant and SA Path test prod (vg3upgrade)

So, you should convert these into VCF files. The coordinates are in the first genome build (ie 37 in "GRCh37_to_GRCh38" so you need to make 4 files (2 GRCh37, 2xGRCh38)

The VCF has 2 mandatory header lines, starting with a single "#" - the fileformat and the #chrom pos etc line that shows column names - see https://davetang.github.io/learning_vcf_file/img/vcf_format.png - heaps of info online

So, make a GRCh37 VCF file from the CSV file columns (chrom/start/ref/alt), and add an "INFO" field eg:

##INFO=<ID=EXPECTED_HGVS,Number=1,Type=String,Description='NCBI g.HGVS to compare with snplift results">

VCF files need to sorted by chrom/pos order - you can probably do this in Pandas pretty easily as you read the CSV

Then put the GRCh38 g.HGVS into this INFO field for each record of the VCF file

Then run the VCF file through converting from GRCh37 to GRCh38. It should come out the other side with GRCh38 coordinates (chrom/pos/ref/alt) but keep the info fields it was given.

Then, write a program that goes through the VCF files and compare the VCFs chrom/pos/ref/alt with the chrom/pos/ref/alt in the g.HGVS from the INFO field. Remember you may need to adjust by 1 to handle VCF/HGVS conversion

Get some stats eg what percentage match, how many snplift can't do etc (I should go back and find out what % ncbi couldn't do)

Also grab a handful of examples where they are different or snp lift fails, and see if you can see any patterns etc. We can then dig into them and see which one was right (could have been NCBI?)

chrom/post/ref/alt

davmlaw commented 8 months ago

So yeah to compare the SNPlift results with NCBI, need to compare newly produced lifted over VCF coords with the g.HGVS from the INFO field (this is what NCBI liftover made)

So you can produce a g.HGVS from the VCF, then compare that - can use this library - if you want to get used to it https://github.com/biocommons/hgvs or annotate VCF with VEP (online or offline) to produce g.HGVS

HGVS can be a pain to setup, you can't use their remote UTA on SA Path network, maybe try cdot - https://github.com/SACGF/cdot

Or maybe just pull it apart with a regex and add/subtract 1 - you can probably skip more complicated HGVS than snps (ie A>C)

YasirKusay commented 8 months ago

I am up to the "comparison" phase but would just like to make a few notes about some observations/challenges about using SNPLift.

YasirKusay commented 8 months ago

I have been trying to do some comparisons now but I have encountered a few roadblocks:

Since I am mostly interested in comparing the VCF outputs, I have chosen to re-write the code in snplift so as to just include any item regardless of score (and delete duplicate entries). However, I feel like this is still something we should talk about next time we meet.

YasirKusay commented 8 months ago

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

YasirKusay commented 8 months ago

I will just comment the results here:

The skipped ones are because they were translated to a non NC sequence.

Overall had quite a high error rate.

davmlaw commented 8 months ago

The versioning may be quite difficult to manage. The chromosome name in the vcf files must perfectly match its corresponding sequence inside the "old" genome file

Requiring contig names match exactly is normal - I think of it as a safety feature to make sure you have the right genome.

In VG we use contig names internally, I just exported them as chromosome name as I probably wasn't thinking about it. But when we export it we'll do it using contig names so that's fine.

I wrote up some docs on this here:

https://github.com/SACGF/variantgrid/wiki/VCF-chromosome---contig-names

You should be able to convert them back via instructions on that page

Split into NUM_CPU files

Can you just force it to use 1 file / 1 cpu?

alignment score of its translation is bad (<0.5)

Maybe you could raise an issue on their GitHub and suggest a command line flag to disable this, or make a patch that handles files with <20 entries? To me it should be looking within a certain region, ie looking at genomic distance, rather than VCF file entries, as it could be eg chr20 next to chrX - the variants on 1 contig have nothing to do with ones on other ones when mapping is concerned

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

We generally ignore these other contigs as it drastically increases complexity of things - just regard this as a failure I think

YasirKusay commented 8 months ago

If you regard the translation from NC to something else, then this program has an extremely high error rate. In the second file alone, there were > 50% of these occurrences.

davmlaw commented 8 months ago

Maybe you could try stripping the fasta files of all of the patch/alt contigs then re-doing the tests? Can probably make a Python script that does something like (thanks ChatGTP):

import gzip
from Bio import SeqIO

input_file = 'orig.fa.gz'
output_file = 'main_contigs_only.fa.gz'

with gzip.open(output_file, 'w') as f_out:
    for record in SeqIO.parse(input_file, 'fasta'):
        if record.id.startswith("NC_"):  # I think this is what we want...
            SeqIO.write(record, f_out, 'fasta')

That would force it to only use those as a destination - hopefully only drop the score a tiny bit

YasirKusay commented 8 months ago

That's actually not a bad idea. I will do that. Hopefully the score goes up this time.

davmlaw commented 8 months ago

As the software is in pre-print stage, I think they would be pretty keen for feedback - preprint corresponding author given is davoud.torkamaneh.1@ulaval.ca

But yeah raising an issue then making a pull request for command line options etc would be great

YasirKusay commented 8 months ago

Just got the updated results here:

Overall had quite a high error rate.

davmlaw commented 8 months ago

It's hard to judge whether it has a high error rate, as this data set is for when ClinGen Allele Registry failed - so presumably it's tricky stuff. If we used this set to validate ClinGen Allele registry it would get 100% failure rate even though it is sub 1% usually

Maybe it would be worth creating a good benchmarking set as well as this difficult one? Perhaps take a 1k subset of ClinVar, where the clingen allele ID is available in both 37/38 VCF files - that way we have a way of checking whether it was right

There are some other liftover tools it might be worth evaluating - see this Biostars post ie:

davmlaw commented 8 months ago

For the benchmark set - I have created a subsample of VCFs before using https://github.com/alexpreynolds/sample

Something like this (untested code copy/pasting lines from my bash history)

# Get header
zgrep "^#" clinvar_20230430.GRCh38.vcf > clinvar_20230430.GRCh38.10k_sample.vcf
./sample --sample-size=10000 clinvar_20230430.GRCh38.vcf --preserve-order >> clinvar_20230430.GRCh38.10k_sample.vcf
bgzip clinvar_20230430.GRCh38.10k_sample.vcf
YasirKusay commented 8 months ago

Should I also try a database aside from UTA?

davmlaw commented 8 months ago

You can use cdot if you want but if UTA is setuo that should work as you're only using UTA for HGVS conversion of genomic variants

Clinvar should have everything in vcf. So to get liftover truth set subsample one, read the the clinvar allele ids into a set, then go through other builds vcf and subset out the ones that have allele ids that match the ones in your set

YasirKusay commented 7 months ago

Just gonna push out a few new stats. I extracted 1000 variants at random from Clinvar for the GRCh37 gene. I also extracted variants with the same ID from GRCh38.

Here are the tests for snplift, crossmap and picard for this new example:

I will soon also test your examples on crossmap and picard. That should give me a good indication of which might be the best to use.

YasirKusay commented 7 months ago

(IN PROCESS OF WRITING THIS)

In my opinion, crossmap is easier to use, picard had quite stringent format requirements that made using it quite time consuming/difficult.

YasirKusay commented 7 months ago

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 1 2 9 12
PICARD 2 (1 removed manually) 0 1 3
SNPLIFT 1 5 6 12

shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 3 6 1 11
PICARD 4 2 0 6
SNPLIFT 1 6 0 7

vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 7 17 1 25
PICARD FAILED FAILED FAILED Failed to Map
SNPLIFT 12 33 1 46

vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 25 50 14 89
PICARD 28 29 45 102
SNPLIFT 27 106 14 147

I would say that its tied between Crossmap and Picard. However, Picard is more difficult to use. I have encountered the following issues when using it. If the variant entry contains a '=' anythere (e.g. in "ALT", "INFO", etc), it will fail. This is because it thinks that there is a key (from INFO) being assigned a value. Also, I think it expects that if you have a value inside INFO, it must have an associated a key (and that must be defined in the header). For testing purposes, I have removed the lines that contain a '=' from the vcf input. I also believe (from some observations) that even minor issues cause the entire program to fail. In my opinion, it is quite frustrating to use and at many times it is difficult to diagnose when an issue arises.

Also, here is another major issue I encountered when using Picard and Crossmap. These require the chain file hg38ToHg19.over.chain.gz (OR hg19ToHg38.over.chain.gz), to work. They are available from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ and http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/ respectively.

The issue with these files are that they contain the chromosome indexes only, and they might be difficult to translate due to the structure of that file. To get around this issue temporarily, I modified the input vcfs and input genome files to use chromosome indexes, translating the vcf chromosome names after I perform a liftover of the VCF file.

This is not an issue with with snplift, however instead of using flags, it uses a shell file where you assign the input/output names, etc (these get exported to the environment). I guess we can just skip this step by assigning our own environment variables (or writing a command line flag interface ourselves).

I recommend that we investigate crossmap more.

EmmaTudini commented 7 months ago

@davmlaw @YasirKusay thanks for looking into this. @davmlaw As we're now in November, I just wanted to confirm that we won't see any issues on the Shariant end when NCBI does retire? Particularly thinking about alleles that are unable to be lifted over using clingen allele registry and therefore try to go down the NCBI liftover route?

davmlaw commented 7 months ago

We have worked out what tool to use, but need to integrate it, do shariant test then deploy

We have disabled even attempting ncbi liftover so won't get error. But we will probably get 0.5% failing to liftover due to clingen allele registry failure

They will sit there with liftover failure until deploy with new tool comes through

EmmaTudini commented 5 months ago

Is there an ETA on this?

I've been looking at the imported alleles and saw that there have been two new genome build missing alleles (against NM transcripts) since NCBI remap was retired. That is compared to a total of three missing genome build alleles (in NM transcripts) throughout the whole of Shariant. It may be that NCBI liftover was saving more alleles than we antcipated

davmlaw commented 5 months ago

I hope to get this done within 2-3 weeks

davmlaw commented 5 months ago

bcftools liftover...

SNVs and indels and it drops less genetic variants than other tools commonly used for the same task such as Picard/LiftoverVcf and CrossMap

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532?login=true

Have documented installing this

https://github.com/SACGF/variantgrid/wiki/Install-bcftools-liftover

Running VG test: /data/annotation/liftover

davmlaw commented 4 months ago

Another data point - after importing SA Path variant tags - number that failed clingen was 1k out of 179k = 0.55%

davmlaw commented 2 months ago

Initial testing failed - couldn't click "create variant" to make it. This was because that only worked with clingen. Fixed,

Testing on vg test, create a super long variant:

NC_000008.11:g.117822562_117852562dup

This has SVLEN of 30001 and ClinGen fails with ClinGenError: VariationTooLong

But I can now click create variant/liftover. Checking the annotated c.HGVS it comes out the same

GRCh37 = NM_000127.3(EXT1):c.963-15361_1320dup GRCh38 = NM_000127.3(EXT1):c.963-15361_1320dup

TheMadBug commented 1 month ago

@davmlaw is there an equivilent setting to LIFTOVER_NCBI_REMAP_ENABLED for this that can be turned off until full testing is complete? (After the deploy of the somatic data I'd like us to spend a deploy with no new functionality, just using biocommons and this)

davmlaw commented 1 month ago

It's off by default, need to enable via:

LIFTOVER_BCFTOOLS_ENABLED = False

TheMadBug commented 1 month ago

Cheers, currently it's enabled for Shariant Test but not Prod. I'm happy to leave it that way for this deploy - and then do the full testing with a goal to enable it in prod at the next deploy.

EmmaTudini commented 3 weeks ago

@davmlaw If a variant is lifted over vis BCF tools, it doesn;t populate g.hgvs. See - https://test.shariant.org.au/variantopedia/view_allele/28939

davmlaw commented 2 weeks ago

g.HGVS not being populated was due to VEP failing to populate any new variants not just BCFTools ones

EmmaTudini commented 2 weeks ago

Note from issue https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid/1014

There are 2 separate things here:

VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this Liftover - conversion between genome builds. Prod currently uses ClinGen allele registry (and just converting MT as it's the same contig on both builds) VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

davmlaw commented 3 days ago

VG testing complete, Shariant testing spun off into new issue