ClinGen / clincoded

This GCI/VCI 1.0 platform has now been retired, and replaced with our new 2.0 platform:
https://github.com/ClinGen/gene-and-variant-curation-tools/issues
MIT License
25 stars 9 forks source link

ClinVar data integration #654

Closed wrightmw closed 7 years ago

wrightmw commented 8 years ago

ClinVar - NCBI database containing the relationships between human variations and phenotypes http://www.ncbi.nlm.nih.gov/clinvar/

Download sources: Complete public data set (XML): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/) Partial set of short variants (VCF): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf/) Complete set of summaries about variants or genes (TSV): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/) Complete set of disease names and gene-disease relationships (TSV): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/)… NB: this specific set is updated daily ClinVar data can be accessed via E-utilities (XML/JSON)

Required fields: IN XML Files:

ClinVar VariationID VariationID e.g. 55629

ClinVar Preferred Name VariationName e.g. NM_007294.3(BRCA1):c.5559C>A (p.Tyr1853Ter)

Clinical significance ClinicalSignificance

Description of variant Description e.g. pathogenic

GeneList:

Entrez Gene ID GeneID e.g. 672

HGNC Approved Gene Symbol Symbol e.g. BRCA1

HGNC gene name FullName e.g. breast cancer 1

HGNC ID HGNCID e.g. HGNC:1100

Strand strand e.g. -

OMIM ID OMIM e.g. 113705

Allele:

ClinVar Allele ID AllelleID e.g. 24984

Variant Type VariantType e.g. single nucleotide variant

Cytogenetic Location CytogeneticLocation e.g. Xq21.1

Genome assembly Assembly e.g. GRCh38

Genomic RefSeq Accession Accession e.g. NC_000023.11

Genomic start position start e.g. 78124992

Genomic stop position stop e.g. 781224992

Length of variant variantLength e.g. 1

Reference Allele referenceAllele e.g. C

Alternate Allele alternateAllele e.g. A

Amino acid change ProteinChange e.g. T352N

HGVSList:

HGVS Name Version e.g. NG_008862.1:g.25824C>A

XRefList:

OMIM ID XRef Type=‘Allelic Variant’ DB=“OMIM" e.g. 311800.0004

dbSNP ID XRef Type=‘rs’ DB=“dbSNP" e.g. 137852530

MolecularConsequenceList:

Molecular Consequence for a specific HGVS term HGVS and SOid and Function e.g. HGVS="NM_000291.3:c.1055C>A" SOid="SO:0001583" Function="missense variant”

ObservationList:

RCV Title and accession RCV Title e.g. "NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn) AND Phosphoglycerate kinase electrophoretic variant PGK II”>RCV000010623

Review Status ReviewStatus e.g. no assertion criteria provided

Date Clinical Significance Evaluated DateLastEvaluated e.g. 2012-04-12

PhenotypeList:

Phenotype Name Phenotype Name e.g. Phosphoglycerate kinase electrophoretic variant PGK II

MedGen ID DB=MedGen XRef ID, e.g. CN069394

Submitter SubmitterName e.g. OMIM

Submitter type ReviewStatus e.g, by single submitter, by lab, etc…

Submitter’s organization ID OrgID e.g. 3

Date last submitted DateLastSubmitted e.g. 2010-12-30

SCV accession ClinVarAccession Acc e.g. SCV000030849

SCV version number Version e.g. 1

CitationList:

PubMed ID ID Source=“PubMed” e.g. 5581984

Usage policy: Open source and open access

Update frequency/Release cycle: Updated first Thursday of the month

wrightmw commented 8 years ago

Sample of ClinVar XML output (for VariationID:9945; NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn)): Sample_ClinVar_XML.docx

wrightmw commented 8 years ago

@kgliu0101 Can you please provide a list of all the field/column headers in the ClinVar XML output?

kgliu0101 commented 8 years ago

@wrightmw, the sample is in their old version. Actually, we are retrieving ClinVar variant data from NCBI eutils API in gene curation. You can get the xml data for variation id 9945 at https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=variation&id=9945 Please note that this data set is variant-based (VariationReport), and its fields/columns are totally different to RCV-based (ClinVarSet). Please have a look of the link above and let me know which one I should use to list.

wrightmw commented 8 years ago

@kgliu0101 Can you please provide a list of both? i.e. two lists... one list of the headers/names provided for ClinVar for their data fields as supplied by NCBI eutils API... and one list of the headers/names for the ClinVar data fields they export both in their XML files.

kgliu0101 commented 8 years ago

@wrightmw, Yes I can. Where you like I write them, in your data integration sheet?

wrightmw commented 8 years ago

Can you please put them in this ticket in github?

kgliu0101 commented 8 years ago

There are lot of fields, I am not sure how to format them here directly. May be I will write them in a excel sheet and attached here.

wrightmw commented 8 years ago

ok...thanks

kgliu0101 commented 8 years ago

For variant-based xml ClinVarXML_field.xlsx

Working on other now.

kgliu0101 commented 8 years ago

@wrightmw There are lot of phenotype and clinical fields in the 2 xml file. Do you need them?

kgliu0101 commented 8 years ago

@wrightmw Have you got a chance to look at the sheet above. I'd like to know if it's what you want. Or should I point where items listed in your Data Integration CVI located in the xml?

wrightmw commented 8 years ago

@kgliu0101 Yes thanks, I'm looking into these now. I'm comparing the different data returned by the JSON and XML for different examples. I don't need further input from you at the moment. I am currently writing the list of ClinVar fields that we will be required for the VCI and once this is complete (later today) I will add this to this ticket in GitHub for you to see.

kgliu0101 commented 8 years ago

@wrightmw Thx. I will stop adding fields from the RCV based xml. If you are looking at json from NCBI eSummary for a variant, note there are no HGVS terms included. However, xml from eutils do have them.

wrightmw commented 8 years ago

@kgliu0101 I went through the field differences between the XML and JSON, and I agree that largely the difference is in HGVS terms which don't seem to be available in the JSON format. However, when I look at your XML output in the file you provided above (ClinVarXML_field.xlsxClinVarXML_field.xlsx) there are missing fields, e.g. I can see the ClinVar VariationID is not in your list but when I looked at the XML it was an available field (see a list of fields I retrieved from the XML format below). Why is VariationID missing from your XML file?:

XML ClinVar fields formatted as 'our name'/'XML field name'/'example':

XML:

ClinVar VariationID VariationID e.g. 55629

ClinVar Preferred Name VariationName e.g. NM_007294.3(BRCA1):c.5559C>A (p.Tyr1853Ter)

Clinical significance ClinicalSignificance

Description of variant Description e.g. pathogenic

GeneList:

Entrez Gene ID GeneID e.g. 672

HGNC Approved Gene Symbol Symbol e.g. BRCA1

HGNC gene name FullName e.g. breast cancer 1

HGNC ID HGNCID e.g. HGNC:1100

Strand strand e.g. -

OMIM ID OMIM e.g. 113705

Allele:

ClinVar Allele ID AllelleID e.g. 24984

Variant Type VariantType e.g. single nucleotide variant

Cytogenetic Location CytogeneticLocation e.g. Xq21.1

Genome assembly Assembly e.g. GRCh38

Genomic RefSeq Accession Accession e.g. NC_000023.11

Genomic start position start e.g. 78124992

Genomic stop position stop e.g. 781224992

Length of variant variantLength e.g. 1

Reference Allele referenceAllele e.g. C

Alternate Allele alternateAllele e.g. A

Amino acid change ProteinChange e.g. T352N

HGVSList:

HGVS Name Version e.g. NG_008862.1:g.25824C>A

XRefList:

OMIM ID XRef Type=‘Allelic Variant’ DB=“OMIM" e.g. 311800.0004

dbSNP ID XRef Type=‘rs’ DB=“dbSNP" e.g. 137852530

MolecularConsequenceList:

Molecular Consequence for a specific HGVS term HGVS and SOid and Function e.g. HGVS="NM_000291.3:c.1055C>A" SOid="SO:0001583" Function="missense variant”

ObservationList:

RCV Title and accession RCV Title e.g. "NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn) AND Phosphoglycerate kinase electrophoretic variant PGK II”>RCV000010623

Review Status ReviewStatus e.g. no assertion criteria provided

Date Clinical Significance Evaluated DateLastEvaluated e.g. 2012-04-12

PhenotypeList:

Phenotype Name Phenotype Name e.g. Phosphoglycerate kinase electrophoretic variant PGK II

MedGen ID DB=MedGen XRef ID, e.g. CN069394

Submitter SubmitterName e.g. OMIM

Submitter type ReviewStatus e.g, by single submitter, by lab, etc…

Submitter’s organization ID OrgID e.g. 3

Date last submitted DateLastSubmitted e.g. 2010-12-30

SCV accession ClinVarAccession Acc e.g. SCV000030849

SCV version number Version e.g. 1

CitationList:

PubMed ID ID Source=“PubMed” e.g. 55819844

kgliu0101 commented 8 years ago

@wrightmw Sorry, I didn't list properties of xml tag, even asked to point them out before, may be not clear enough.

One q. ClinVar provides allele frequency for some variants, like https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=variation&id=42652 should we retrieve them?

wrightmw commented 8 years ago

@kgliu0101 I would say no to retrieving the allele frequencies from ClinVar since they do not provide them for every entry. We should retrieve the allele frequencies either directly from ESP, 1KG and ExAC or all in go from an aggregator such as VEP.

selinad commented 8 years ago

Thanks for all the hard work going on! Sitting across from Steven, who wasn't certain to answer - how hard is it to work in later?

wrightmw commented 8 years ago

@selinad I hope you are enjoying DC! With respect to the VCI we were planning on using the NC genomic HGVS expression to bring in the allele frequency data from ESP, 1KG and ExAC for all entries... it's my understanding that the allele frequency data in ClinVar is not always provided. Would you suggest only showing ClinVar allele frequency data if it is available? If you think there is a possibility that this information could be useful to us then we should not exclude the ClinVar allele frequency data in our output from the NCBI API. Let's just keep it.

selinad commented 8 years ago

@wrightmw - Steven was uncertain and we are forging ahead on model. I'll try to get you a more firm answer - if it's reasonable to do later, we can wait.

kgliu0101 commented 8 years ago

Thanks lot, both of you.

Another q. Will it be possible that we have variants from other out source beyond ClinVar and CAR? Right now, variants can be grouped as either come from ClinVar and/or registered in CAR (brand new).

wrightmw commented 8 years ago

@selinad Sounds like you are doing a great job of extracting data from Steven's brain.

selinad commented 8 years ago

lol. It would have been good to set some time to meet with him, but we are going through all the criteria for data modeling purposes, so getting a better idea

wrightmw commented 8 years ago

@kgliu0101 It's my understanding that if curators present the VCI with a variant that is not in ClinVar then they will have to get a canonical allele ID (CA ID) from the CAR in order to proceed. Therefore there will always be a VariatonID and/or CA ID for each variant. @selinad is that right? Is registration with the CAR a requirement?

kgliu0101 commented 8 years ago

@wrightmw Great to know that. I think it's time to edit variant schema now. Ming and Jimmy will need it and testing data very soon.

kgliu0101 commented 8 years ago

@wrightmw Is the ticket #668 (Create ClinVar modal for the Variant Selection page) includes work to edit variant schema? If yes, I am going to assign it to me. If not, I am going to create a new ticket.

wrightmw commented 8 years ago

@kgliu0101 Ticket #668 is just for creating the ClinVar modal for the Variant Selection page. I think work to edit the underlying variant schema should go in a separate ticket.

kgliu0101 commented 8 years ago

Thanks Matt.

wrightmw commented 8 years ago

This ticket is for integrating ClinVar data for the data portal. A separate ticket has now been created for accessing ClinVar data via an API for the variant curation interface: ticket #673

kgliu0101 commented 8 years ago

@wrightmw Thanks lot to point that.

selinad commented 8 years ago

Hi @wrightmw just looking through this and see I missed a question.

Yes, here is the order:

wrightmw commented 7 years ago

Closed. This was was implemented in first VCI release.