ga4gh / va-spec

An information model for representing variant annotations.
15 stars 2 forks source link

StudyResult.sourceDataSet extends InformationEntity.derivedFrom...I have a question about that. #159

Open larrybabb opened 2 weeks ago

larrybabb commented 2 weeks ago

@mbrush the InformationEntity.derivedFrom is an unordered array of InformationEntitys. However, the StudyResult.sourceDataset appears to be designed to override the array nature of it's parent derivedFrom property and make it a DataSet (not an array of DataSets).

I have referenced this here in the CohortAlleleFrequencyStudyResult schema (which is a direct copy of the sourceDataSet from the StudyResult).

While I get the idea of using derivedFrom as a representation of the dataset from which the StudyResult was attained, I think we need to weigh whether

  1. sourceDataset should be some type of RecordMetadata type
  2. InformationEntity.derivedFrom should NOT be an array but rather a single source (which in turn would have it's own derivedFrom)
  3. sourceDataset should NOT be extending derivedFrom to begin with

I'm in favor of #2.

For now, I am going to make StudyResult.sourceDataset an array of DataSet types and assume folks will only put 1 entry in the array. But this is not a reasonable final solution IMO.

mbrush commented 2 weeks ago

hmm - I actually favor the solution you implemented (make StudyResult.sourceDatasets an array of DataSets . . . and if everything in the StudyResult was derived form a single DataSet, then there will be only one member of this array).

I don't think your solution 2 above is right, because the use case for allowing multiple values here isn't to track a linear trail of 1:1 derivations, as your comments imply. The idea here is that InformationEntities can be derived from multiple direct 'source' InformationEntities. e.g a CAF StudyResult may include data about its focusAllele that was pulled from two distinct DataSets produced by a given study.

I don't think your solution 1 is right because the sourceDataSet property is about the derivation of information content found in a StudyResult, not about specific concrete serializations of the the StudyResult (which is what the RecordMetada object is for).