Investigate how allele origin is currently processed by the pipeline

tskir commented 4 years ago

Reported by @AsierGonzalez via Slack on 2020-09-14

Hi Kirill, I have a couple of questions for you. The other day I was looking into some variant in ClinVar and I realised that the value of Origin was “Unknown”. I don’t remember what variant that was but I have easily found another example of a variant that lacks an origin. I checked if these variants end up in the evidence files but there are only germline and somatic variants in there and I think I have found where in your code this is controlled.

What values can Origin have in ClinVar?
How many entries are there whose origin is neither germline nor somatic?

Also, I believe that alleleOrigin only appears in the unique associations fields section. Is this value really necessary to make the evidence strings unique? Regardless of what the answer is, we should include the field in the body of the evidence, because there should be not fields that only exist in the unique_assocation_fields

tskir commented 4 years ago

@AsierGonzalez

As far as I remember, that assert in our code never fired for any of the batches—meaning that, at least in the XML dump which we use, all records are either germline or somatic. (Alternative explanation is that maybe there is some filtering applied somewhere upstream of that assert.)

I agree that this discrepancy (given that you were able to find a variant which is neither) looks interesting. I will take a look into ClinVar data and plot distributions of different allele origin values.

Also, we could include the allele origin into the evidence body with no problem. In which section would you like to see it? variant2disease probably seems fitting

AsierGonzalez commented 4 years ago

I think that the two reasonable locations for the allele origin are variant and evidence.variant2disease. As always, it depends on whether this is a property of the variant itself or the variant in a specific disease. I have found an example where the same ClinVar entry has two different values for the origin depending on the submission (note that one of those is unknown), so probably putting it in evidence.variant2disease will be the best option.

tskir commented 4 years ago

Oh, wait a minute, a thought just occurred to me... We actually do store the allele origins in the evidence strings, just not under the same names as in ClinVar. In fact, the allele origin determines which evidence string type will be used:

germline → genetic_association
somatic → somatic_mutation

Given that fact, do we want to store the original alleleOrigin values again, given that they're going to be 100% redundant?

Having said that, variants with allele origin of "unknown" will still need to be investigated regardless

AsierGonzalez commented 4 years ago

If the information is mapped to the data source as you suggest there is no need to include it in the body of the evidence but it should be removed from the unique association fields.

AsierGonzalez commented 4 years ago

Another question about the allelic origin that needs answering is what happens when there are multiple submissions for the same ClinVar accession and each of them have different Origin values. That is the case for the example shared above: Screenshot 2020-09-15 at 12 27 30 Is there a ClinVar accession-level Origin in the XML that can be used directly?

AsierGonzalez commented 4 years ago

Interestingly, the information about this ClinVar accession (RCV000162096) in the evidence strings seems to come from the first submission in that table given that the allelic origin and clinical significance match its values only:

"type": "genetic_association",
"clinical_significance": "Likely pathogenic"

tskir commented 4 years ago

@AsierGonzalez

Current pipeline behaviour

Indeed, the record RCV000143247 from your example has an allele origin of “unknown” in the XML. You also correctly pinpointed the location in the code where the decision is made. At first I was surprised that “unknown” doesn't trip the if/else/assert block, but then I realised that the origins fed into it are being preprocessed here. As you can see, it goes like this:

If “somatic” is present in the list of allele origins, treat the record as “somatic”;
Else if at least one value from a fixed list (including “unknown” and “germline”, among others) is present, treat the record as “germline”;
In all other cases (either no allele origin at all, or something which does not come from the predetermined lists above), then skip the record entirely.

I think this explains the results you currently observe, including how multiple value combinations are treated. I will now look into which combinations are actually present in ClinVar and will post an update once this is ready.

tskir commented 4 years ago

Value distributions in ClinVar

I updated my script to plot allele origin distributions and processed the latest ClinVar XML dump (as of 2020-10-13). This is the result. Granted, the diagram is quite tall due to a large number of different values, but I didn't feel like implementing a separate rendering option for a simple investigation.

allele-origin

tskir commented 4 years ago

Possible next steps

I am closing this issue, as the investigations have been completed. I have created a separate follow-up issue to implement the actual changes, which will be out of scope of pipeline v2.0 (and the corresponding release 20.11): https://github.com/EBIvariation/eva-opentargets/issues/163. Please do let me know (either in this issue or the new one) how you would like to proceed and which changes you would like to see implemented.

AsierGonzalez commented 4 years ago

This is interesting, it means that the using the allele origin to classify the ClinVar entries as "genetic_association" or "somatic_mutations" (see comment above) is not accurate and the evidence strings would be more complete if the real value was included in them. I will suggest changes in the new ticket.

EBIvariation / CMAT