ga4gh / va-spec

An information model for representing variant annotations.
15 stars 3 forks source link

If/how to include ‘anclillary results’ and ‘quality measures’ modeling patterns in the core IM #144

Open mbrush opened 3 months ago

mbrush commented 3 months ago

If I understand, the ancillary results and quality measures modeling patterns that provide empty buckets for implementations to create properties to capture project-specific types of data and quality measures in a Study Result which don't belong in the standard profile. More specifically, they are simply properties that take an object of an unspecified type with additional properties allowed. e.g. from the ga4gh standard caf profile:

image

A project-specific implementation profile/schema may define specific properties withiin these untyped object, to capture ancillary results or quality measures they want to report that are specific to their project. These properties are not generally useful enough however to be considered for inclusion in a standard a ga4gh 'standard' profile (e.g the caf-profile). Is this right?

A few thoughts/questions:

larrybabb commented 3 months ago
  • It strikes me that these offer an alternative to using the Extension mechanism that is built into the va model. The rationale/value add here is that defining these specific properties (ancillary results and quality measures) gives a bit more semantics/guidance about what types of extended content may be collected here. But the Extension mechanism could be used by an implementation profile here to achieve essentially the same end. Is this right?

I will defer to @ahwagner on this for the final word, but I think these two specific data structures may exceed what I (we) envisioned for the extensions model. While it is true that anything can go under an extension, some items are fairly essential to the use case needed by some implementations that creating first class attributes off of the standard profile outweigh the unnecessary complexity of adding these special uses under extensions.

I understand that this is all non-standard, but then again, the profile is meant to be based on the standard and implementers want to make additions in any way they see fit then so be it. The gnomad cohort allele frequency data needed these ancillary results for our immediate use. And, yes, we could have gone back to the standard's drawing board to attempt to model this in a more democratized way, but we just didn't have the bandwidth, time and resources to do that.

  • One question to address here is if/how/where we want to include this modeling pattern in the v1 va-spec release? (i.e. at what level do we define ancillary results and quality measures properties.

    • Should we add these as properties of the core-im StudyResult class – so that profiles like the caf can use/extend/inherit from these core im properties?
    • Or should these stay out of the core-im, and be defined only in the caf profile and other StudyResult profiles that may come up?

Again, deferring to @ahwagner for the final word. I'm fine with moving ancillary results and quality measures to the gnomad specific profile of caf. If alex disagrees, then, yes, we should add these to the standard profile (when and if we have the time to do it).

ahwagner commented 3 months ago

These properties are not generally useful enough however to be considered for inclusion in a standard a ga4gh 'standard' profile (e.g the caf-profile). Is this right?

No, this is not about importance. These are important properties. Arguably, the content of these fields are at least as important as the CAF result itself. The problem is that there is no consensus across resources on what types of quality measures or ancillary results should be used. We are starting with an open approach, and (down the road) can add in common quality measures or ancillary results as they are identified across resources.

The rationale/value add here is that defining these specific properties (ancillary results and quality measures) gives a bit more semantics/guidance about what types of extended content may be collected here. But the Extension mechanism could be used by an implementation profile here to achieve essentially the same end. Is this right?

I agree with the first half of this statement: these properties provide the semantics of quality measures and ancillary results. I disagree this is essentially the same end as use of Extensions on the parent object.

I do not think these should be moved down to the gnomAD profile; they should be useful across CAF implementations and should stay with the CAF standard profile. We may consider a new parent class that includes these, as I expect this pattern will be useful for other evidence types.

larrybabb commented 2 months ago

I left these two attributes in the standard profile for CohortAlleleFrequencyStudyResult based on the discussion above. We can modify this further as we learn more about other CAF-like sources that may disagree or want to enhance this model.

@mbrush Let me know if you think this issue is worth continuing to discuss and make changes. In the spirit of finding a good compromise so that we can move forward we may want to close this out and revisit when/if another CAF-like implementation arises.

mbrush commented 2 months ago

To be clear, and as indicated in the title and description of this issue, the question is about not whether we want to add these ‘anclillary results’ and ‘quality measures’ modeling patterns in the core-im, becasue they may be generally useful in other types of profiles. The issue was NOT about whether they should remain in the initial version of the CAF Standard Profile. It seems from comments above that there may have been a misunderstanding about this.

This is probably an issue of lower priority relative to others, so let's table it for now.

mbrush commented 1 week ago

Seeing that this issue is relevant to a questions raised by @Mrinal-Thomas-Epic for the Connect Implementation Warrior session, I will add one more thought here.

I wanted to note that the Core-IM StudyResult.dataItem property is unique in that there is an allowance to specialize this into more than one new named properties in a StudyResult profile. Conceptually, this attribute is specialized in the CAFStudyResult profile into the attributes focusAlleleCount, locusAlleleCount, and alleleFrequency. (Note that the StudyResult.dataItems property is commented out for now to avoid its unwanted inheritance in StudyResult profiles that import the core-im. We will re-instate this property once we determine how the modeling framework and tools can support formal specification of the conceptual specialization that is happening here).

The ancillary results and quality measures that are captured by the attributes in question (e.g. grpMaxFAF95 and homozygotes) are simply additional 'data items' that are included in the StudyResult object. But ones that are uniquely required by a particular implementation, and thus not reflected in the data item attributes in the CAF standard profile.

What this means is that implementation CAF StudyResult models like the one for gnomad gk-pilot can just go ahead and create new named properties in their implementation schema for these ancillary or quality measure data items directly, at the same level / alongside the specific data item attributes defined in the schema. These are just additional 'specializations' of the core-im 'dataItem' attribute. No need to bucket them in nested structures.

The gk pilot schema for this would simply import the standard CAFStudyResult profile, and add a few more attributes to this class. Something like:


$schema: "https://json-schema.org/draft/2020-12/schema"
$id: "https://w3id.org/ga4gh/schema/gk-pilot/1.x/gnomad/gnomad-caf-source.yaml"
title: gnomAD Cohort Allele Frequency Study Result profile
type: object

imports:
  va.caf-profile: ../profiles/caf-study-result-source.yaml

$defs:
  properties:
      grpMaxFAF95:
        $ref: "#/$defs/GrpMaxFAF95"
      homozygotes:
        type: integer
      hemizygotes:
        type: integer

And gk pilot CAF data would look something like:


- id: gnomad4:1-10120-T-G.sas
  type: CohortAlleleFrequencyStudyResult
  label: South Asian Ancestry Group Allele Frequency for 1-10120-T-G
  sourceDataSet:
    - id: gnomad4.1.0
      type: DataSet
      label: gnomAD v4.1.0
      version: 4.1.0
  focusAllele: ga4gh:VA.XTZZHRCS2lIZuST7_LnLSHwro5uYYtVF
  focusAlleleCount: 0
  locusAlleleCount: 716
  alleleFrequency: 0
  grpMaxFAF95: {}
  homozygotes: 0
  hemizygotes: 0

Here, the implementation specific attributes sit at the same level as the ones from the standard CAF profile - but this is allowed by the va-spec, as again, conceptually they are just additional specializations of the StudyResult.dataItems attribute.

That's it . . . just wanted put this out there as an option, not to say it is better or worse than other approaches.