Unclear distinction between ExperimentalModel and ExperimentalCohort

hansenp commented 1 year ago

The documentation states that the ExperimentalModel can be used for both individuals and cohorts.

https://thejacksonlaboratory.github.io/ExperimentalModelSchema/experimental_model.html

On the next page, the ExperimentalCohort is introduced.

https://thejacksonlaboratory.github.io/ExperimentalModelSchema/experimental_cohort.html

There it says that this element is intended for the case where there is no data for individuals, but only for a cohort.

Which element/message should be used to represent cohorts, ExperimentalModel or ExperimentalCohort?

sbello commented 7 months ago

@hansenp I am unable to view any documentation using the links in this ticket. The links go to an index age and links on that page, at least those I tried, all give 404 errors.

hansenp commented 7 months ago

Sorry, this is because the documentation was migrated from ReadTheDocs to MkDocs after this issue was created. The relevant links are now as follows:

ExperimentalModel: https://thejacksonlaboratory.github.io/ExperimentalModelSchema/ems/experimental_model/

ExperimentalCohort: https://thejacksonlaboratory.github.io/ExperimentalModelSchema/ems/experimental_cohort/

sbello commented 7 months ago

@hansenp I agree it is unclear what is meant by experimental model vs cohort.

Is cohort meant to be the collection of all animals in an experiment? Or is this meant to be used to for pooled data? I don't know that this distinction is really needed based on the description of the purpose of the model. Possibly instead of the experimental cohort there should be a model for 'Cohort' analogous to the 'Animal' model. The cohort would then be used to represent a pooled set of animals which could then reference the IDs of the set of 'Animal's in the cohort. Or if the source was pooled before entering the system you might need to be able to accommodate the cohorts that have a description of the strain background, sex and number of animals in the pool without having Animal IDs.

I also find the name for these confusing. It appears to be from the experimental model document a collection of the data annotated to the an animal or group of animals. I'm used to thinking of models in terms of the organism, I would have called this ExperimentalAnnotation.

While the example shows just phenotype annotations the description states "all the data about an individual animal model or about cohorts that are represented as single observations". Based on the experience of developing models for data at the Alliance, I am concerned about trying to fit all data types into a single model. The broad range of metadata that can be attached to different types of data is going to make this model very complex. It may be better to break the annotation models down by data type so that the types of metadata for each annotation type can be limited to those that are relevant.

hansenp commented 7 months ago

First of all, I would like to say that the ExperimentalModelSchema, as it currently stands on GitHub, is just a draft that resulted from meetings between Peter Robinson and the MPD people. I created issues like this one to point out specific gaps and inconsistencies so that we can discuss them and improve the model. I will address your questions below.

Is cohort meant to be the collection of all animals in an experiment? Or is this meant to be used to for pooled data? I don't know that this distinction is really needed based on the description of the purpose of the model. Possibly instead of the experimental cohort there should be a model for 'Cohort' analogous to the 'Animal' model. The cohort would then be used to represent a pooled set of animals which could then reference the IDs of the set of 'Animal's in the cohort. Or if the source was pooled before entering the system you might need to be able to accommodate the cohorts that have a description of the strain background, sex and number of animals in the pool without having Animal IDs.

If I understand correctly, you are describing two cases that a model or class (I use the term class in the following) “Cohort” should be able to represent:

Cohort-1: A cohort consists of an arbitrary set of animals (typically from an experiment), each represented by a reference ID.
Cohort-2: Data for individual animals are not available and the cohort consists only of metadata, and possibly averaged measurements, associated phenotypic information, etc.

It should always be possible to derive an instance of Cohort-2 from an instance of Cohort-1. To do this, we would have to define how we want to aggregate the attributes of individual animals. For example, average values from individual measured values, lists of strain IDs or phenotype ontology terms, etc. Then it would be enough if Cohort-2 had an additional field with a list of animal reference IDs. If data for individual animals is not available, metadata and aggregated data could be written directly into the appropriate fields and the list of animal reference IDs would remain empty. Otherwise, the aggregated data could be determined from the individual animals and written into the appropriate fields.

I also find the name for these confusing. It appears to be from the experimental model document a collection of the data annotated to the an animal or group of animals. I'm used to thinking of models in terms of the organism, I would have called this ExperimentalAnnotation.

I agree. My suggestion: A class Animal for individual animals and a class Cohort for groups of animals.

While the example shows just phenotype annotations the description states "all the data about an individual animal model or about cohorts that are represented as single observations". Based on the experience of developing models for data at the Alliance, I am concerned about trying to fit all data types into a single model. The broad range of metadata that can be attached to different types of data is going to make this model very complex. It may be better to break the annotation models down by data type so that the types of metadata for each annotation type can be limited to those that are relevant.

If I understand correctly, would you prefer to define small, simple classes/data types first in order to be able to flexibly compile more complex classes from them? I would agree with that.

sbello commented 7 months ago

Thanks @hansenp for the responses. Cindy asked that I look at the issues and provide feedback where I could based on experience with MGI and the Alliance. I've developed the habit of @'ing the person I'm responding to from other projects, apologies if the responses should not be directed at you. Much like you I just want to offer thoughts to improve the draft where I can.

TheJacksonLaboratory / ExperimentalModelSchema

Unclear distinction between ExperimentalModel and ExperimentalCohort #7