Clinical Sequencing Nomenclature (CSN)

munzmarci commented 9 years ago

This issue has been created as a discussion forum of the proposed Clinical Sequencing Nomenclature (CSN), a standardized, versioned nomenclature for reporting clinical sequence data,

CSN would provide a standardized system in which each DNA sequence variant has a single notation allowing integration of sequencing data from multiple sources and facilitating more accurate clinical interpretation of genomic information. The CSN of every DNA sequence variant would also be identical for pre-NGS and NGS mutation detection methods and would follow the principles of the existing HGVS nomenclature with minor amendments to ensure compatibility and integration of historical clinical sequence data,

CSN would use a logical terminology understandable to non-experts allowing easy visual discrimination between the major classes of variant in clinical genomics.

dglazer commented 9 years ago

Is there a specific proposal you'd like the group to discuss, either for what CSN is, or for how the GA4GH should work with it? (I found some links on the web, but am not sure they're the right ones.) The goals of CSN listed here sound like truth and apple pie; I assume the details are messier.

deannachurch commented 9 years ago

I think the goals of CSN are great, but I think some of the implementation details are difficult.

Table 1. The variant classification system: This table describes a 'class' (that is a short abbreviation to describe a variant) and the description of the class. I think this table should be rethought a bit. First, I think it would be better to use existing ontologies than create a new set of abbreviations and descriptions. If existing ontology schemes don't work, then let's work with them to fix the problem rather than develop something new. Additionally, I think this table confounds a couple of concepts: Where is this variant in the gene? (e.g. 5PU- any variant in 5' untranslated region) What does this variant do? (e.g. SG variant caused by base substitution) I think it would be better to have these as separate concepts.

Table 2. CSN vs. current nomenclature I love the idea of putting some standards on HGVS, but I'm concerned about concatenating the c. and p. as CSN does (c.1040A>Gp.Gln347Arg). First: '' is the same delimiter used if there is an indel (e.g. 15:g.89149778_89149779delTTinsAGGACTTG)

Second: the unique sequence identifier the c. (or p.) is referring too needs to be referenced explicitly. At it's heart, HGVS is a lossy variant representation- even given a unique sequence identifier, you may not be able to unambiguously determine the genomic position for the variant, but you certainly have a better chance!

Just a couple of quick thoughts.

pgrosu commented 9 years ago

@deannachurch I am not seeing any tables that you might be referring to. Regarding CSN, that can probably just be a plugin to our system to transform and output the relevant data from our implementation as CSN, similar to something like this:

https://github.com/ensembl-variation/VEP_plugins/blob/master/CSN.pm

~p

lh3 commented 9 years ago

For reference.

pgrosu commented 9 years ago

Ah, from the paper, thanks Heng :)

pcingola commented 9 years ago

I think the main point is to use CSN only as a stricter subset of HGVS (i.e. ignore the variant "class" since much better options do exist and we already agreed to use ontologies).

As we mentioned in the call, other problems do exists: i) Using '=' sign to denote synonymous conflicts with VCF spec. ii) The transcript version is not identified in 'p.' notation.

Nevertheless these problems seem to be easy to fix (either by amendments to CSN or by using CSN as a starting point to create a new 'HGVS subset'). These amendments would include: i) Replace '=' by something else (for instance we could use the same protein to indicate synonymous changes,e.g. 'p.Trp34Trp'). ii) Separate 'c.' and 'p.' by comma instead of underscore iii) Use transcript version (including sub-version) in 'p.' notation.

(I'm probably forgetting to add other items in this list).

deannachurch commented 9 years ago

Having a simpler subset of HGVS is great. However, as far as the annotation API goes, I think we want the data elements to be explicit- they can be filtered and simplified for display at the client end- otherwise we run the risk of serving data without really understanding what it is. To this end, I think we should:

c. and p. representations should be separate entities (not concatenated with any delimiter)
- each should have an explicit sequence identifier (either RefSeq accession.version or Ensembl ID and annotation release number) Keeping this separated is clearer and better supports non-coding variants.
We should explicitly store three distinct values with respect to providing information about the variant and transcript (using existing ontologies)
- location specific information (intronic, promoter, utr, intergenic, etc)
- effect (missense, nonsense, frameshift, etc)
- mutational consequence (LOF, gain of function) etc.

Keeping this separate is also clearer- these things are harder to calculate as you go down the list- and makes explicit what information you are actually conferring.

Ideally, we should also store an annotation on the transcript denoting any that are annotated as 'clinical' and the source of that annotation (HGMD, ClinVar, LRG, etc). This allows us to support current requirements (where there is not always universal consensus on what the reference transcript should be- but everyone one wants a single one) and hopefully future requirements (where we drop this idea or actually decide on a consensus).

pcingola commented 9 years ago

c. and p. representations should be separate entities (not concatenated with any delimiter)

Sorry, I was thinking about INFO fields in VCF files when I suggested the comma. Of course in our schemas they should be treated as separate entities.

awz commented 9 years ago

Happy to see c. and p. separated.

Can CSN handle very famous variants such as the causative mutation of sickle cell anemia (when homozygous) - HBB E6V?

http://ghr.nlm.nih.gov/gene/HBB

"Specifically, the amino acid glutamic acid is replaced with the amino acid valine at position 6 in beta-globin, written as Glu6Val or E6V."

versus

http://www.ncbi.nlm.nih.gov/clinvar/RCV000016573 http://www.uniprot.org/uniprot/P68871

"NM_000518.4(HBB):c.20A>T (p.Glu7Val)"

The preferred name in ClinVar. N.B. that prior to cleavage of the initiator methionine HBB E6V would actually be HBB E7V.

Should we collect a set of clinically important but hard to name variants?

fcunningham commented 9 years ago

@awz Great - we have a folder for hard to annotate variants - please do contribute clinically important but hard to name variants here: https://drive.google.com/folderview?id=0B6jIo0eTEQxrfmJPSF9hQ1pscWNJVUc5bldCVWpGOVd1QzJORklxOTJLVnE3d2pkWmt2N2c&usp=sharing

reece commented 9 years ago

@munzmarci: I'm with @dglazer: What is the specific question being asked here? Until we answer that, this thread is unlikely to facilitate a decision and action.

Some possible related questions (and related decisions) that might merit a thread:

What HGVS shortcomings prevent variant data sharing?
What human readable textual description should be used for sequence variants?
Should GA4GH support CSN?
Is CSN as viable replacement for HGVS?

@munzmarci: Would you please rephrase as a question or a proposal? Alternatively, if folks agree that there's no question here, my vote is to close this thread.

reece commented 9 years ago

@awz, @fcunningham: How we handle hard cases of variant naming seems like a worthwhile question itself. I just created #333 for that topic.

nazneenrahman commented 9 years ago

Dear all, Sorry for not replying before. We set up this thread at the request of the GA4 variant annotation group. But I agree with Reece that it is probably not quite the right forum at this point.

The underlying rationale for the CSN was to address the requirements of variant nomenclature in the clinical setting. The needs in this context are naturally overlapping with other contexts, but are not identical, or at least the prioritisation of requirements are different. In the clinical setting, first and foremost, one wants to ensure that clinical benefit is best served and clinical harm is avoided. The current systems of variant annotation are not satisfactorily meeting either of these needs, for various reasons. Therefore changes/modifications are required for genomic medicine to be successful. It has become a critical issue because the number and diversity of people interfacing with variants has vastly increased. Further details about the current problems and the rationale for the changes we proposed in CSN are provided in doi: http://dx.doi.org/10.1101/016808. I am happy to bore/worry people on the needs, problems and potential harms for many hours, as long as they bring gin!

I think the key decisions required at this point are strategic ones. i.e. a) whether GA4 agree there are specific issues with variant annotation in the clinical setting and b) if so, should GA4 get involved in trying to sort out that problem, which will take a dedicated focus and input from end-users in the clinical setting, which I believe are not well-represented currently.

I was planning to join the call on tues 16th to discuss.

pcingola commented 9 years ago

I agree, but I'd like to clarify that variant annotation and variant reporting (e.g. for clinical applications) are different stages in the analysis pipeline. In my opinion, HGVS, CSN and other representative conventions are used in the reporting stage (regardless on when we compute them). I know this might look like nitpicking, but lack of role clear separation can create confusion because the annotation and reporting goals are quite different. Just as an example: annotations would output a large VCF file whereas reporting would outputs a single page clinical report (or information for a scientific paper, depending on the application). The reason why all the previous comments in this thread apply is because we want to make sure we can track the reported variant back to the previous pipeline stages as well as compare it with other reported variants.

nazneenrahman commented 9 years ago

Hi Pablo Yes I fully agree with both of those points.

From: Pablo Cingolani [mailto:notifications@github.com] Sent: 12 June 2015 12:45 To: ga4gh/schemas Cc: Rahman Research Subject: Re: [schemas] Clinical Sequencing Nomenclature (CSN) (#312)

I agree, but I'd like to clarify that variant annotation and variant reporting (e.g. for clinical applications) are different stages in the analysis pipeline. In my opinion, HGVS, CSN and other representative conventions are used in the reporting stage (regardless on when we compute them). I know this might look like nitpicking, but lack of role clear separation can create confusion because the annotation and reporting goals are quite different. Just as an example: annotations would output a large VCF file whereas reporting would outputs a single page clinical report (or information for a scientific paper, depending on the application). The reason why all the previous comments in this thread apply is because we want to make sure we can track the reported variant back to the previous pipeline stages as well as compare it with other reported variants.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/schemas/issues/312#issuecomment-111465978.

The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.

skeenan commented 9 years ago

ga4gh / ga4gh-schemas

Clinical Sequencing Nomenclature (CSN) #312