ga4gh / vrs-hackathons

Project tracking for GA4GH Variation Representation Specification hackathons
Apache License 2.0
1 stars 0 forks source link

From VCF to VRS #6

Open andreasprlic opened 2 years ago

andreasprlic commented 2 years ago

Submitter Name

Andreas Prlić

Submitter Affiliation

Invitae

Submitter Github Handle

andreasprlic

Additional Submitter Details

No response

Which event day would the project be offered?

Project Details

From VCF to VRS

Build a tool that allows to quickly associate all alleles in a VCF file with vr-spec representation (hashes?)

Input: VCF, output: a file with a mapping of the alleles to VRS allele hashes.

One problem: Performance of creating the VRS representation is slow-ish. We will want to cache that.

Implementation sketch:

1) Write a tool that given a VCF file, can scan it 2) Lookup vcf-keys (assembly name, chromosome accession, start, stop, ref, alt) for each allele in cache. If missing, add VRS representation to cache (sqllite?, anyvar?) 3) Write out a file of VCF alleles mapped to VRS hashes. (details of the representation could be looked up in the cache, or the verbosity of the created output file could be configured via parameters)

Required Skills

Python

reece commented 2 years ago

Great idea.

How about these two outcomes: 1) VCF digester, 2) Use the digester for perf analysis of entire stack ?

wesleygoar commented 2 years ago

We now have a VCF annotation script in VRS-python. We can have users try out the script and then we could ask for improvement or next steps from the group.

ahwagner commented 2 years ago

Documentation issue: what does this work for, what does it not?

wesleygoar commented 2 years ago

I tested this using the Genome in the bottle 4.2.1 benchmark vcf. It worked great except for the star alleles which are currently left as blank for vrs representation.

mbaudis commented 2 years ago

CNVs, e.g. from the 1000genomesDRAGEN set? We have converted this for Progenetix with the CNVs accessible through our Beacon API (e.g. http://progenetix.org/beacon/biosamples/onekgbs-HG00142/g_variants.

IMO the VCF provided by Lajoie et al. would be a good example case to:

Admittedly I haven't explored the current state of the VRS-python tools but would love to get pointers what is there/missing (CNV et al. related).

larrybabb commented 2 years ago

@wesleygoar can you put a 1-3 minute pitch on this topic to give at the start of the day to try to entice folks to join in on this? we are planning on having a cutoff of 4 people minimum to officially work as a group on a topic. Every topic lead will pitch.