Open andreasprlic opened 2 years ago
Great idea.
How about these two outcomes: 1) VCF digester, 2) Use the digester for perf analysis of entire stack ?
We now have a VCF annotation script in VRS-python. We can have users try out the script and then we could ask for improvement or next steps from the group.
Documentation issue: what does this work for, what does it not?
I tested this using the Genome in the bottle 4.2.1 benchmark vcf. It worked great except for the star alleles which are currently left as blank for vrs representation.
CNVs, e.g. from the 1000genomesDRAGEN set? We have converted this for Progenetix with the CNVs accessible through our Beacon API (e.g. http://progenetix.org/beacon/biosamples/onekgbs-HG00142/g_variants.
IMO the VCF provided by Lajoie et al. would be a good example case to:
Admittedly I haven't explored the current state of the VRS-python tools but would love to get pointers what is there/missing (CNV et al. related).
@wesleygoar can you put a 1-3 minute pitch on this topic to give at the start of the day to try to entice folks to join in on this? we are planning on having a cutoff of 4 people minimum to officially work as a group on a topic. Every topic lead will pitch.
Submitter Name
Andreas Prlić
Submitter Affiliation
Invitae
Submitter Github Handle
andreasprlic
Additional Submitter Details
No response
Which event day would the project be offered?
Project Details
From VCF to VRS
Build a tool that allows to quickly associate all alleles in a VCF file with vr-spec representation (hashes?)
Input: VCF, output: a file with a mapping of the alleles to VRS allele hashes.
One problem: Performance of creating the VRS representation is slow-ish. We will want to cache that.
Implementation sketch:
1) Write a tool that given a VCF file, can scan it 2) Lookup vcf-keys (assembly name, chromosome accession, start, stop, ref, alt) for each allele in cache. If missing, add VRS representation to cache (sqllite?, anyvar?) 3) Write out a file of VCF alleles mapped to VRS hashes. (details of the representation could be looked up in the cache, or the verbosity of the created output file could be configured via parameters)
Required Skills
Python