genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
237 stars 71 forks source link

Add ID column to SV BED file #10

Closed SHuang-Broad closed 3 years ago

SHuang-Broad commented 3 years ago

Hi,

I constantly make use of the GIAB SV callset and really appreciate the effort of curating all of these.

I do have one feature request:

The SV BED file right now contains only the coordinates but not the type of variant the interval is associated with, or the originating variant ID available from the VCF (in HG19). An IGV trick that I constantly use is packing some information—that I want to quickly get for the variant—from the source VCF into the ID (4th) column of the BED file, which will be displayed by IGV. This way one doesn't need to click on a VCF record just for a quick glance.

I'd appreciate it if the VCF ID records are copied into the BED file.

Thank you! Steve

jzook commented 3 years ago

Thanks for your suggestion! We generally recommend using the Tier 1 vcf file for this information, and the Tier 1 bed describes the regions in which we've made (almost) all the SV calls in the vcf. We don't have an easy way to add annotations to the Tier 2 bed since many of the variants are complex, but we are working towards new assembly-based benchmarks to describe these, including one focused on medically relevant genes for which we'll post a draft very soon. In the meantime, you could use a whole genome hifiasm/dipcall vcf to get one estimate of the potential SV call in HG002 - ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.01.00/GRCh37/hifiasm_v0.11.

SHuang-Broad commented 3 years ago

Thanks Justin!

My assumption was that the BED file was generated from the VCF ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz But that might not be true based on your reply.

The reason I'm working with BED is that we typically work with GRCh38, so lifting over the BED is easy but not the VCF itself.

jzook commented 3 years ago

I apologize for the confusion - as you suspect, the Tier1 BED has a very different meaning. NCBI has remapped the vcf to GRCh38 at https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd175.GRCh38.variant_call.vcf.gz, though there likely are some edge cases that did not remap optimally. You could use this in combination with the whole genome hifiasm/dipcall vcf for GRCh38 to estimate potential SV calls in HG002 - ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.01.00/GRCh38/hifiasm_v0.11 .

On Wed, Dec 9, 2020 at 6:12 PM Steve Huang notifications@github.com wrote:

Thanks Justin!

My assumption was that the BED file was generated from the VCF

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz But that might not be true based on your reply.

The reason I'm working with BED is that we typically work with GRCh38, so lifting over the BED is easy but not the VCF itself.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/genome-in-a-bottle/giab_data_indexes/issues/10#issuecomment-742122505, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTU5Q7RXWGO3G4AO5BCBLST7747ANCNFSM4UUCLGGA .

SHuang-Broad commented 3 years ago

Thanks for the information Justin!