BCFTools liftover - replace NCBI remap which is retired

davmlaw commented 1 year ago

https://ncbiinsights.ncbi.nlm.nih.gov/2023/05/04/remap-tool-to-retire/

davmlaw commented 1 year ago

Here's the numbers:

Shariant prod - NCBI linked 119 of 40620 alleles (0.292959133431807 %) SA Path - NCBI linked 743 of 181911 alleles (0.408441490619039 %)

from snpdb.models import VariantAllele, AlleleConversionTool

# note field is 'conversion_tool' in older code
va_ncbi = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP)
va = VariantAllele.objects.all()
perc = va_ncbi.count() * 100 / va.count()
print(f"NCBI linked {va_ncbi.count()} of {va.count()} alleles ({perc} %)")

davmlaw commented 1 year ago

Saving the 0.5% of ones that fail ClinGen allele registry is worth it, especially as that has a 2k limit and we're going to be using CNVs

So, first thing is to evaluate the options. There is the UCSC one and also:

SNPLift: Fast and accurate conversion of genetic variant coordinates across genome assemblies

Then we need to make NCBI liftover only work historically, and switch in the replacement.

davmlaw commented 10 months ago

To dump out our variants that used NCBI (for testing), use the following:

import pandas as pd

from socket import gethostname
from snpdb.models import GenomeBuild, VariantAllele, AlleleConversionTool
from genes.hgvs import HGVSMatcher

def get_ncbi_from_build(src_genome_build, dest_genome_build):
        src_matcher = HGVSMatcher(src_genome_build)
        dest_matcher = HGVSMatcher(dest_genome_build)
        src_ghgvs = f"{src_genome_build.name}_ghgvs"
        dest_ghgvs = f"{dest_genome_build.name}_ghgvs"

        liftover_from_build = []
        # old code = conversion_tool=
        va_ncbi_qs = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP, genome_build=dest_genome_build)
        for va in va_ncbi_qs:
            v_src = va.allele.variant_for_build(src_genome_build)
            v_dest = va.allele.variant_for_build(dest_genome_build)
            # If this fails need to use ._asdict().copy()
            data = v_src.coordinate.__dict__.copy()
            data["allele"] = va.allele_id
            data[src_ghgvs] = src_matcher.variant_to_g_hgvs(v_src)
            data[dest_ghgvs] = dest_matcher.variant_to_g_hgvs(v_dest)
            liftover_from_build.append(data)
        return liftover_from_build

def write_csv(src_genome_build, dest_genome_build):
        liftover_from_build = get_ncbi_from_build(src_genome_build, dest_genome_build)
        df = pd.DataFrame.from_records(liftover_from_build)
        filename = f"/tmp/{gethostname()}_ncbi_liftover_from_{src_genome_build}_to_{dest_genome_build}.csv"
        print(f"Wrote '{filename}'")
        df.to_csv(filename)

write_csv(GenomeBuild.grch37(), GenomeBuild.grch38())
write_csv(GenomeBuild.grch38(), GenomeBuild.grch37())

davmlaw commented 8 months ago

@YasirKusay if you could:

Setup SNPLift
Document any tricky bits etc
Run it on some test VCFs, and eg see what % of them work, have errors etc

Within a few days I will provide you with CSVs of NCBI liftover (see comment above) results

Run SNPLift on same variants - then we can compare results and see what concordance it has with NCBI liftover and ClinGen allele registry

YasirKusay commented 8 months ago

It was (relatively) ok to setup SNPLift.

Setup

No easy requirements.txt file so need to install everything manually.
Does not list versions (samtools needs a version greater than 1.10, annoyingly, had to reinstall samtools).
Also needs GNU Parallel
Also need scipy
In addition, need the following R packages: data.tables and scales (not listed in installation)

Using it

Inside 02_infos/snplift_config.sh, add all necessary config info, including old/new genomes and inputs/outputs (and any extra settings to use e.g. for alignment).
To verify installation: ./01_scripts/util/run_test.sh
Run ./snplift 02_infos/snplift_config.sh

davmlaw commented 8 months ago

Looks like service is now gone we should change settings in all environments:

SA Path prod / ~test~ / ~train~ - need to wait till out of hours to restart prod
~vg.com~ / ~vg test~ - settings.LIFTOVER_NCBI_REMAP_ENABLED has always been False
shariant / test
~runx1~ - settings.LIFTOVER_NCBI_REMAP_ENABLED has always been False

davmlaw commented 8 months ago

I emailed you a list of NCBI liftovers from Shariant and SA Path test prod (vg3upgrade)

So, you should convert these into VCF files. The coordinates are in the first genome build (ie 37 in "GRCh37_to_GRCh38" so you need to make 4 files (2 GRCh37, 2xGRCh38)

The VCF has 2 mandatory header lines, starting with a single "#" - the fileformat and the #chrom pos etc line that shows column names - see https://davetang.github.io/learning_vcf_file/img/vcf_format.png - heaps of info online

So, make a GRCh37 VCF file from the CSV file columns (chrom/start/ref/alt), and add an "INFO" field eg:

##INFO=<ID=EXPECTED_HGVS,Number=1,Type=String,Description='NCBI g.HGVS to compare with snplift results">

VCF files need to sorted by chrom/pos order - you can probably do this in Pandas pretty easily as you read the CSV

Then put the GRCh38 g.HGVS into this INFO field for each record of the VCF file

Then run the VCF file through converting from GRCh37 to GRCh38. It should come out the other side with GRCh38 coordinates (chrom/pos/ref/alt) but keep the info fields it was given.

Then, write a program that goes through the VCF files and compare the VCFs chrom/pos/ref/alt with the chrom/pos/ref/alt in the g.HGVS from the INFO field. Remember you may need to adjust by 1 to handle VCF/HGVS conversion

Get some stats eg what percentage match, how many snplift can't do etc (I should go back and find out what % ncbi couldn't do)

Also grab a handful of examples where they are different or snp lift fails, and see if you can see any patterns etc. We can then dig into them and see which one was right (could have been NCBI?)

chrom/post/ref/alt

davmlaw commented 8 months ago

So yeah to compare the SNPlift results with NCBI, need to compare newly produced lifted over VCF coords with the g.HGVS from the INFO field (this is what NCBI liftover made)

So you can produce a g.HGVS from the VCF, then compare that - can use this library - if you want to get used to it https://github.com/biocommons/hgvs or annotate VCF with VEP (online or offline) to produce g.HGVS

HGVS can be a pain to setup, you can't use their remote UTA on SA Path network, maybe try cdot - https://github.com/SACGF/cdot

Or maybe just pull it apart with a regex and add/subtract 1 - you can probably skip more complicated HGVS than snps (ie A>C)

YasirKusay commented 8 months ago

I am up to the "comparison" phase but would just like to make a few notes about some observations/challenges about using SNPLift.

It will need the genome files (GRCH37 and 38), and they need to be indexed with bwa-mem. Fortunately, there is an option in the settings inside 02_infos/snplift_config.sh called SKIP_INDEXING that will auto generate the index. Be sure to switch this setting off once you generate the index.
The versioning may be quite difficult to manage. The chromosome name in the vcf files must perfectly match its corresponding sequence inside the "old" genome file. Hence, you cannot simply use the chromosome number inside the vcf file and the chromosome name in the vcf file must match the full name of its corresponding sequence in the old genome file INCLUDING its version.

YasirKusay commented 8 months ago

I have been trying to do some comparisons now but I have encountered a few roadblocks:

It firstly aligns and fetches the variant translations into (NUM_CPU) files, all having the same number of lines (except the last file). These translations are put into this file in the same order as their original variant appeared in the VCF file. This means that translations belonging to the same chromosome may be in different files and translations belonging to different chromosomes are in the same file. Also, the same translation entry may appear in several different files (this most likely is a bug). These could be issues in the next step.
When determining if a VCF entry will be added to the new VCF file, if the alignment score of its translation is bad (<0.5), it will inspect the adjacent variants (if physically close enough). If their average score is good enough, it will be added to the new VCF file (this is just a simplification). This however could cause issues if the VCF file in the input is small, as it will look at by default 20 adjacent entries, (can be changed in the settings), and my program has experienced crashes when the VCF entries are small (except when adjusting the number of adjacent entries it looks at).

Since I am mostly interested in comparing the VCF outputs, I have chosen to re-write the code in snplift so as to just include any item regardless of score (and delete duplicate entries). However, I feel like this is still something we should talk about next time we meet.

YasirKusay commented 8 months ago

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

YasirKusay commented 8 months ago

I will just comment the results here:

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- 66 inputs, 14 skipped, 4 mismatches, 0 exceptions.
shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- 64 inputs, 37 skipped, 4 mismatches
vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- 292 inputs, 69 skipped, 24 mismatches, 4 variants caused an exception error.
vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- 593 inputs, 268 skipped, 57 mismatched, 14 variants caused an exception error.

The skipped ones are because they were translated to a non NC sequence.

Overall had quite a high error rate.

davmlaw commented 8 months ago

The versioning may be quite difficult to manage. The chromosome name in the vcf files must perfectly match its corresponding sequence inside the "old" genome file

Requiring contig names match exactly is normal - I think of it as a safety feature to make sure you have the right genome.

In VG we use contig names internally, I just exported them as chromosome name as I probably wasn't thinking about it. But when we export it we'll do it using contig names so that's fine.

I wrote up some docs on this here:

https://github.com/SACGF/variantgrid/wiki/VCF-chromosome---contig-names

You should be able to convert them back via instructions on that page

Split into NUM_CPU files

Can you just force it to use 1 file / 1 cpu?

alignment score of its translation is bad (<0.5)

Maybe you could raise an issue on their GitHub and suggest a command line flag to disable this, or make a patch that handles files with <20 entries? To me it should be looking within a certain region, ie looking at genomic distance, rather than VCF file entries, as it could be eg chr20 next to chrX - the variants on 1 contig have nothing to do with ones on other ones when mapping is concerned

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

We generally ignore these other contigs as it drastically increases complexity of things - just regard this as a failure I think

YasirKusay commented 8 months ago

Regarding the contig names, what I did when generating the VCF file from your input is I used the contig name from the HGSV of the non-translated entry.
Regarding the NUM_CPU files, yes you can manually specify that. It splits it to NUM_CPU files for multiprocessing.
I can raise an issue on SNPLift.
- Regarding the genomic distance comment, if that condition is triggered for a poorly scoring translated variant, it will only look at the closest translated variants that belong on the same chromosome. Did NCBI remap also do this check (I am sure that it does not). How important is this feature in your opinion.
- So far, I have just commented out that code in SNPLift and just included all translated variants for comparison purposes

If you regard the translation from NC to something else, then this program has an extremely high error rate. In the second file alone, there were > 50% of these occurrences.

davmlaw commented 8 months ago

Maybe you could try stripping the fasta files of all of the patch/alt contigs then re-doing the tests? Can probably make a Python script that does something like (thanks ChatGTP):

import gzip
from Bio import SeqIO

input_file = 'orig.fa.gz'
output_file = 'main_contigs_only.fa.gz'

with gzip.open(output_file, 'w') as f_out:
    for record in SeqIO.parse(input_file, 'fasta'):
        if record.id.startswith("NC_"):  # I think this is what we want...
            SeqIO.write(record, f_out, 'fasta')

That would force it to only use those as a destination - hopefully only drop the score a tiny bit

YasirKusay commented 8 months ago

That's actually not a bad idea. I will do that. Hopefully the score goes up this time.

davmlaw commented 8 months ago

As the software is in pre-print stage, I think they would be pretty keen for feedback - preprint corresponding author given is davoud.torkamaneh.1@ulaval.ca

But yeah raising an issue then making a pull request for command line options etc would be great

YasirKusay commented 8 months ago

Just got the updated results here:

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- 66 inputs, 5 mismatches, 6 variants caused an exception error.
shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- 64 inputs, 6 mismatches
vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- 291 inputs, 33 mismatches, 1 variant caused an exception error.
vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- 593 inputs, 106 mismatches, 14 variants caused an exception error.

Overall had quite a high error rate.

davmlaw commented 8 months ago

It's hard to judge whether it has a high error rate, as this data set is for when ClinGen Allele Registry failed - so presumably it's tricky stuff. If we used this set to validate ClinGen Allele registry it would get 100% failure rate even though it is sub 1% usually

Maybe it would be worth creating a good benchmarking set as well as this difficult one? Perhaps take a 1k subset of ClinVar, where the clingen allele ID is available in both 37/38 VCF files - that way we have a way of checking whether it was right

There are some other liftover tools it might be worth evaluating - see this Biostars post ie:

davmlaw commented 8 months ago

For the benchmark set - I have created a subsample of VCFs before using https://github.com/alexpreynolds/sample

Something like this (untested code copy/pasting lines from my bash history)

# Get header
zgrep "^#" clinvar_20230430.GRCh38.vcf > clinvar_20230430.GRCh38.10k_sample.vcf
./sample --sample-size=10000 clinvar_20230430.GRCh38.vcf --preserve-order >> clinvar_20230430.GRCh38.10k_sample.vcf
bgzip clinvar_20230430.GRCh38.10k_sample.vcf

YasirKusay commented 8 months ago

Should I also try a database aside from UTA?

davmlaw commented 8 months ago

You can use cdot if you want but if UTA is setuo that should work as you're only using UTA for HGVS conversion of genomic variants

Clinvar should have everything in vcf. So to get liftover truth set subsample one, read the the clinvar allele ids into a set, then go through other builds vcf and subset out the ones that have allele ids that match the ones in your set

YasirKusay commented 7 months ago

Just gonna push out a few new stats. I extracted 1000 variants at random from Clinvar for the GRCh37 gene. I also extracted variants with the same ID from GRCh38.

Here are the tests for snplift, crossmap and picard for this new example:

37_to_38
- SNPLIFT: 2 mismatches, 1 unmapped
- CROSSMAP: 1 unmapped, 2 mismatches
- PICARD: 2 unmapped, 0 mismatches
38_to_37
- SNPLIFT: 2 mismatches,1 unmapped
- CROSSMAP: 2 unmapped, 1 mismatch
- PICARD: 2 unmapped, 0 mismatches

I will soon also test your examples on crossmap and picard. That should give me a good indication of which might be the best to use.

YasirKusay commented 7 months ago

(IN PROCESS OF WRITING THIS)

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- crossmap: 9 exceptions, 2 mismatches, 1 failed to map
- picard: manually removed 1 record. 1 exception. 1 failed to translate.
shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- crossmap: 6 mismatches, 1 exception, 3 failed to map
- picard: 2 mismatches, 4 failed to translate
vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
- crossmap: 7 failed to map, 17 mismatches, 1 exception
- picard: manually removed 2 records. FAILED COMPLETELY (Will look into it more later)
vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
- crossmap: 25 failed to map. 50 mismatches, 14 exceptions,
- picard: 29 mismatches, 45 exceptions, 28 failed to translate

In my opinion, crossmap is easier to use, picard had quite stringent format requirements that made using it quite time consuming/difficult.

YasirKusay commented 7 months ago

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program	Failed to Map	Mismatches	Exceptions	Total Errors
CROSSMAP	1	2	9	12
PICARD	2 (1 removed manually)	0	1	3
SNPLIFT	1	5	6	12

shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program	Failed to Map	Mismatches	Exceptions	Total Errors
CROSSMAP	3	6	1	11
PICARD	4	2	0	6
SNPLIFT	1	6	0	7

vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program	Failed to Map	Mismatches	Exceptions	Total Errors
CROSSMAP	7	17	1	25
PICARD	FAILED	FAILED	FAILED	Failed to Map
SNPLIFT	12	33	1	46

vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program	Failed to Map	Mismatches	Exceptions	Total Errors
CROSSMAP	25	50	14	89
PICARD	28	29	45	102
SNPLIFT	27	106	14	147

I would say that its tied between Crossmap and Picard. However, Picard is more difficult to use. I have encountered the following issues when using it. If the variant entry contains a '=' anythere (e.g. in "ALT", "INFO", etc), it will fail. This is because it thinks that there is a key (from INFO) being assigned a value. Also, I think it expects that if you have a value inside INFO, it must have an associated a key (and that must be defined in the header). For testing purposes, I have removed the lines that contain a '=' from the vcf input. I also believe (from some observations) that even minor issues cause the entire program to fail. In my opinion, it is quite frustrating to use and at many times it is difficult to diagnose when an issue arises.

Also, here is another major issue I encountered when using Picard and Crossmap. These require the chain file hg38ToHg19.over.chain.gz (OR hg19ToHg38.over.chain.gz), to work. They are available from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ and http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/ respectively.

The issue with these files are that they contain the chromosome indexes only, and they might be difficult to translate due to the structure of that file. To get around this issue temporarily, I modified the input vcfs and input genome files to use chromosome indexes, translating the vcf chromosome names after I perform a liftover of the VCF file.

This is not an issue with with snplift, however instead of using flags, it uses a shell file where you assign the input/output names, etc (these get exported to the environment). I guess we can just skip this step by assigning our own environment variables (or writing a command line flag interface ourselves).

I recommend that we investigate crossmap more.

EmmaTudini commented 7 months ago

@davmlaw @YasirKusay thanks for looking into this. @davmlaw As we're now in November, I just wanted to confirm that we won't see any issues on the Shariant end when NCBI does retire? Particularly thinking about alleles that are unable to be lifted over using clingen allele registry and therefore try to go down the NCBI liftover route?

davmlaw commented 7 months ago

We have worked out what tool to use, but need to integrate it, do shariant test then deploy

We have disabled even attempting ncbi liftover so won't get error. But we will probably get 0.5% failing to liftover due to clingen allele registry failure

They will sit there with liftover failure until deploy with new tool comes through

EmmaTudini commented 5 months ago

Is there an ETA on this?

I've been looking at the imported alleles and saw that there have been two new genome build missing alleles (against NM transcripts) since NCBI remap was retired. That is compared to a total of three missing genome build alleles (in NM transcripts) throughout the whole of Shariant. It may be that NCBI liftover was saving more alleles than we antcipated

davmlaw commented 5 months ago

I hope to get this done within 2-3 weeks

davmlaw commented 5 months ago

bcftools liftover...

SNVs and indels and it drops less genetic variants than other tools commonly used for the same task such as Picard/LiftoverVcf and CrossMap

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532?login=true

Have documented installing this

https://github.com/SACGF/variantgrid/wiki/Install-bcftools-liftover

Running VG test: /data/annotation/liftover

davmlaw commented 4 months ago

Another data point - after importing SA Path variant tags - number that failed clingen was 1k out of 179k = 0.55%

davmlaw commented 2 months ago

Initial testing failed - couldn't click "create variant" to make it. This was because that only worked with clingen. Fixed,

Testing on vg test, create a super long variant:

NC_000008.11:g.117822562_117852562dup

This has SVLEN of 30001 and ClinGen fails with ClinGenError: VariationTooLong

But I can now click create variant/liftover. Checking the annotated c.HGVS it comes out the same

GRCh37 = NM_000127.3(EXT1):c.963-15361_1320dup GRCh38 = NM_000127.3(EXT1):c.963-15361_1320dup

TheMadBug commented 1 month ago

@davmlaw is there an equivilent setting to LIFTOVER_NCBI_REMAP_ENABLED for this that can be turned off until full testing is complete? (After the deploy of the somatic data I'd like us to spend a deploy with no new functionality, just using biocommons and this)

davmlaw commented 1 month ago

It's off by default, need to enable via:

LIFTOVER_BCFTOOLS_ENABLED = False

TheMadBug commented 1 month ago

Cheers, currently it's enabled for Shariant Test but not Prod. I'm happy to leave it that way for this deploy - and then do the full testing with a goal to enable it in prod at the next deploy.

EmmaTudini commented 3 weeks ago

@davmlaw If a variant is lifted over vis BCF tools, it doesn;t populate g.hgvs. See - https://test.shariant.org.au/variantopedia/view_allele/28939

davmlaw commented 2 weeks ago

g.HGVS not being populated was due to VEP failing to populate any new variants not just BCFTools ones

EmmaTudini commented 2 weeks ago

Note from issue https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid/1014

There are 2 separate things here:

VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this Liftover - conversion between genome builds. Prod currently uses ClinGen allele registry (and just converting MT as it's the same contig on both builds) VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

davmlaw commented 3 days ago

VG testing complete, Shariant testing spun off into new issue

SACGF / variantgrid