Closed wrightmw closed 7 years ago
Sample of ClinVar XML output (for VariationID:9945; NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn)): Sample_ClinVar_XML.docx
@kgliu0101 Can you please provide a list of all the field/column headers in the ClinVar XML output?
@wrightmw, the sample is in their old version. Actually, we are retrieving ClinVar variant data from NCBI eutils API in gene curation. You can get the xml data for variation id 9945 at https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=variation&id=9945 Please note that this data set is variant-based (VariationReport), and its fields/columns are totally different to RCV-based (ClinVarSet). Please have a look of the link above and let me know which one I should use to list.
@kgliu0101 Can you please provide a list of both? i.e. two lists... one list of the headers/names provided for ClinVar for their data fields as supplied by NCBI eutils API... and one list of the headers/names for the ClinVar data fields they export both in their XML files.
@wrightmw, Yes I can. Where you like I write them, in your data integration sheet?
Can you please put them in this ticket in github?
There are lot of fields, I am not sure how to format them here directly. May be I will write them in a excel sheet and attached here.
ok...thanks
For variant-based xml ClinVarXML_field.xlsx
Working on other now.
@wrightmw There are lot of phenotype and clinical fields in the 2 xml file. Do you need them?
@wrightmw Have you got a chance to look at the sheet above. I'd like to know if it's what you want. Or should I point where items listed in your Data Integration CVI located in the xml?
@kgliu0101 Yes thanks, I'm looking into these now. I'm comparing the different data returned by the JSON and XML for different examples. I don't need further input from you at the moment. I am currently writing the list of ClinVar fields that we will be required for the VCI and once this is complete (later today) I will add this to this ticket in GitHub for you to see.
@wrightmw Thx. I will stop adding fields from the RCV based xml. If you are looking at json from NCBI eSummary for a variant, note there are no HGVS terms included. However, xml from eutils do have them.
@kgliu0101 I went through the field differences between the XML and JSON, and I agree that largely the difference is in HGVS terms which don't seem to be available in the JSON format. However, when I look at your XML output in the file you provided above (ClinVarXML_field.xlsxClinVarXML_field.xlsx) there are missing fields, e.g. I can see the ClinVar VariationID is not in your list but when I looked at the XML it was an available field (see a list of fields I retrieved from the XML format below). Why is VariationID missing from your XML file?:
XML ClinVar fields formatted as 'our name'/'XML field name'/'example':
XML:
ClinVar VariationID VariationID e.g. 55629
ClinVar Preferred Name VariationName e.g. NM_007294.3(BRCA1):c.5559C>A (p.Tyr1853Ter)
Clinical significance ClinicalSignificance
Description of variant Description e.g. pathogenic
GeneList:
Entrez Gene ID GeneID e.g. 672
HGNC Approved Gene Symbol Symbol e.g. BRCA1
HGNC gene name FullName e.g. breast cancer 1
HGNC ID HGNCID e.g. HGNC:1100
Strand strand e.g. -
OMIM ID OMIM e.g. 113705
Allele:
ClinVar Allele ID AllelleID e.g. 24984
Variant Type VariantType e.g. single nucleotide variant
Cytogenetic Location CytogeneticLocation e.g. Xq21.1
Genome assembly Assembly e.g. GRCh38
Genomic RefSeq Accession Accession e.g. NC_000023.11
Genomic start position start e.g. 78124992
Genomic stop position stop e.g. 781224992
Length of variant variantLength e.g. 1
Reference Allele referenceAllele e.g. C
Alternate Allele alternateAllele e.g. A
Amino acid change ProteinChange e.g. T352N
HGVSList:
HGVS Name Version e.g. NG_008862.1:g.25824C>A
XRefList:
OMIM ID XRef Type=‘Allelic Variant’ DB=“OMIM" e.g. 311800.0004
dbSNP ID XRef Type=‘rs’ DB=“dbSNP" e.g. 137852530
MolecularConsequenceList:
Molecular Consequence for a specific HGVS term HGVS and SOid and Function e.g. HGVS="NM_000291.3:c.1055C>A" SOid="SO:0001583" Function="missense variant”
ObservationList:
RCV Title and accession RCV Title e.g. "NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn) AND Phosphoglycerate kinase electrophoretic variant PGK II”>RCV000010623
Review Status ReviewStatus e.g. no assertion criteria provided
Date Clinical Significance Evaluated DateLastEvaluated e.g. 2012-04-12
PhenotypeList:
Phenotype Name Phenotype Name e.g. Phosphoglycerate kinase electrophoretic variant PGK II
MedGen ID DB=MedGen XRef ID, e.g. CN069394
Submitter SubmitterName e.g. OMIM
Submitter type ReviewStatus e.g, by single submitter, by lab, etc…
Submitter’s organization ID OrgID e.g. 3
Date last submitted DateLastSubmitted e.g. 2010-12-30
SCV accession ClinVarAccession Acc e.g. SCV000030849
SCV version number Version e.g. 1
CitationList:
PubMed ID ID Source=“PubMed” e.g. 55819844
@wrightmw Sorry, I didn't list properties of xml tag, even asked to point them out before, may be not clear enough.
One q. ClinVar provides allele frequency for some variants, like https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=variation&id=42652 should we retrieve them?
@kgliu0101 I would say no to retrieving the allele frequencies from ClinVar since they do not provide them for every entry. We should retrieve the allele frequencies either directly from ESP, 1KG and ExAC or all in go from an aggregator such as VEP.
Thanks for all the hard work going on! Sitting across from Steven, who wasn't certain to answer - how hard is it to work in later?
@selinad I hope you are enjoying DC! With respect to the VCI we were planning on using the NC genomic HGVS expression to bring in the allele frequency data from ESP, 1KG and ExAC for all entries... it's my understanding that the allele frequency data in ClinVar is not always provided. Would you suggest only showing ClinVar allele frequency data if it is available? If you think there is a possibility that this information could be useful to us then we should not exclude the ClinVar allele frequency data in our output from the NCBI API. Let's just keep it.
@wrightmw - Steven was uncertain and we are forging ahead on model. I'll try to get you a more firm answer - if it's reasonable to do later, we can wait.
Thanks lot, both of you.
Another q. Will it be possible that we have variants from other out source beyond ClinVar and CAR? Right now, variants can be grouped as either come from ClinVar and/or registered in CAR (brand new).
@selinad Sounds like you are doing a great job of extracting data from Steven's brain.
lol. It would have been good to set some time to meet with him, but we are going through all the criteria for data modeling purposes, so getting a better idea
@kgliu0101 It's my understanding that if curators present the VCI with a variant that is not in ClinVar then they will have to get a canonical allele ID (CA ID) from the CAR in order to proceed. Therefore there will always be a VariatonID and/or CA ID for each variant. @selinad is that right? Is registration with the CAR a requirement?
@wrightmw Great to know that. I think it's time to edit variant schema now. Ming and Jimmy will need it and testing data very soon.
@wrightmw Is the ticket #668 (Create ClinVar modal for the Variant Selection page) includes work to edit variant schema? If yes, I am going to assign it to me. If not, I am going to create a new ticket.
@kgliu0101 Ticket #668 is just for creating the ClinVar modal for the Variant Selection page. I think work to edit the underlying variant schema should go in a separate ticket.
Thanks Matt.
This ticket is for integrating ClinVar data for the data portal. A separate ticket has now been created for accessing ClinVar data via an API for the variant curation interface: ticket #673
@wrightmw Thanks lot to point that.
Hi @wrightmw just looking through this and see I missed a question.
Yes, here is the order:
Closed. This was was implemented in first VCI release.
ClinVar - NCBI database containing the relationships between human variations and phenotypes http://www.ncbi.nlm.nih.gov/clinvar/
Download sources: Complete public data set (XML): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/) Partial set of short variants (VCF): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf/) Complete set of summaries about variants or genes (TSV): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/) Complete set of disease names and gene-disease relationships (TSV): (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/)… NB: this specific set is updated daily ClinVar data can be accessed via E-utilities (XML/JSON)
Required fields: IN XML Files:
ClinVar VariationID VariationID e.g. 55629
ClinVar Preferred Name VariationName e.g. NM_007294.3(BRCA1):c.5559C>A (p.Tyr1853Ter)
Clinical significance ClinicalSignificance
Description of variant Description e.g. pathogenic
GeneList:
Entrez Gene ID GeneID e.g. 672
HGNC Approved Gene Symbol Symbol e.g. BRCA1
HGNC gene name FullName e.g. breast cancer 1
HGNC ID HGNCID e.g. HGNC:1100
Strand strand e.g. -
OMIM ID OMIM e.g. 113705
Allele:
ClinVar Allele ID AllelleID e.g. 24984
Variant Type VariantType e.g. single nucleotide variant
Cytogenetic Location CytogeneticLocation e.g. Xq21.1
Genome assembly Assembly e.g. GRCh38
Genomic RefSeq Accession Accession e.g. NC_000023.11
Genomic start position start e.g. 78124992
Genomic stop position stop e.g. 781224992
Length of variant variantLength e.g. 1
Reference Allele referenceAllele e.g. C
Alternate Allele alternateAllele e.g. A
Amino acid change ProteinChange e.g. T352N
HGVSList:
HGVS Name Version e.g. NG_008862.1:g.25824C>A
XRefList:
OMIM ID XRef Type=‘Allelic Variant’ DB=“OMIM" e.g. 311800.0004
dbSNP ID XRef Type=‘rs’ DB=“dbSNP" e.g. 137852530
MolecularConsequenceList:
Molecular Consequence for a specific HGVS term HGVS and SOid and Function e.g. HGVS="NM_000291.3:c.1055C>A" SOid="SO:0001583" Function="missense variant”
ObservationList:
RCV Title and accession RCV Title e.g. "NM_000291.3(PGK1):c.1055C>A (p.Thr352Asn) AND Phosphoglycerate kinase electrophoretic variant PGK II”>RCV000010623
Review Status ReviewStatus e.g. no assertion criteria provided
Date Clinical Significance Evaluated DateLastEvaluated e.g. 2012-04-12
PhenotypeList:
Phenotype Name Phenotype Name e.g. Phosphoglycerate kinase electrophoretic variant PGK II
MedGen ID DB=MedGen XRef ID, e.g. CN069394
Submitter SubmitterName e.g. OMIM
Submitter type ReviewStatus e.g, by single submitter, by lab, etc…
Submitter’s organization ID OrgID e.g. 3
Date last submitted DateLastSubmitted e.g. 2010-12-30
SCV accession ClinVarAccession Acc e.g. SCV000030849
SCV version number Version e.g. 1
CitationList:
PubMed ID ID Source=“PubMed” e.g. 5581984
Usage policy: Open source and open access
Update frequency/Release cycle: Updated first Thursday of the month