GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

Registering replicate samples #26

Open ramonawalls opened 5 years ago

ramonawalls commented 5 years ago

From ENA:

At ENA we have also been receiving queries on how to register/declare replicate samples of the types or Technical replicates: same sample across multiple conditions, e.g. same physical sample from the same person sequenced twice Since these are the same physical sample, ENA have been advising submitters to create one sample and two experiments, one pointing to X in the library_name of the experiment and the second experiment pointing to X_2 in in the library_name of the second experiment. Real example can be seen here:

ramonawalls commented 5 years ago

We discussed this at a call on June 12, 2019, and propose the addition of the following two new terms for MIxS:

1: label: replicate status

type: CV

definition: Indicate if the sample is a technical or biological replicate. Technical replicates sample the same entity multiple times, e.g., aliquots from a single water or soil sample collected in the field or multiple sequencing runs on subsamples of a single liver tissue sample from one patient. Biological replicates sample different entities in the same populations, e.g., liver tumors from 5 different patients under the same set of conditions (control or treatment) or multiple soil samples collected in a single field. If not a replicate, leave blank. All technical replicates should share a single sampleID, but there will be one experiment per replicate. Each biological replicate should have a unique sampleID and the biological replicate set field must have a value.

values: technical replicate, biological replicate

2: label: biological replicate set

type: string

definition: An identifier for the set of samples that form a set of biological replicates. Must be unique within a BioProject. May be globally unique.

only1chunts commented 5 years ago

would a relationship qualifier be more appropriate than a term called "biological replicate set"? e.g. in GigaDB we already use the term "sample relationship" with the definition "please include the relationship type and the sample name, ID or accession, e.g. "sample_relationship = isSiblingOf:X" or "sample_relationship = isDerivedFrom:Y" This can be extended to include "isBiologicalReplicateOf: sample 1" etc... for the only 2 examples of its use in GigaDB so far see: http://dx.doi.org/10.5524/100276 and http://dx.doi.org/10.5524/100445

jdeck88 commented 5 years ago

I see any sample as originating from some event. Thus, we can infer the biological replicate set for a set of children by saying they originated from some common event. I think, Chris, you are referring to this in the syntax "sample_relationship = isDerivedFrom:Y". Or, i would say: sample1 event1 sample2 event2 etc...

I think we can put aside inevitable arguments about what means but can focus on specifying an eventID (or "process which disambiguates sample from its surrounding environment identifier"?)

An example of this in GEOME is at: https://geome-db.org/record/ark:/21547/CXl2MOO_26(0.5_0)

John

only1chunts commented 5 years ago

Thanks John, when i click your example link it just says "Requested Resource is Forbidden" so I cant see that, but it seems to me that the method I described will easily handle your example, and can be used to include any other relationships between samples that people might want to add as well. Using your example: an attribute for sample 1 would be "sample_relationship=isBiologicalReplicateOf:event1" This gives the reader (both man and machine) the ability to interpret the information correctly. If I recall correctly, GSC also use this same syntax for the attribute "host_family_relationship=" in the human microbiome packages, where the value is a composite of the relationship type with the related sample ID. (although looking at the v5 spreadsheet we use ";" instead of ":" to separate, but I cant use ; in gigadb as it clashes with other syntax.) From the original post in this thread that same example would need to be entered as "replicate_status=biological" and "biological_replicate_set=event1" Infact only 1 of those is required to tell the user the sample is a biological replicate, so if we go down that route we should remove the redundancy, just have "technical replicate set=" or "biological replicate set=" as checklist items, no need for "replicate status=".

jdeck88 commented 5 years ago

Thanks Chris... First... I updated the example using a publicly accessible one (doh!)

Second... Yes, think we're on the same page, but wouldn't it be better to just say something like replicateParent=event1 instead of "sample_relationship=isBiologicalReplicateOf:event1" ?

only1chunts commented 5 years ago

I was just trying to enable the attribute to include all types of relationships not just replicates. But if we want to separate out replicate relationships from time series or differently Molecular extracts (dna/rna/metabolites) or siblings or any others types of sample relationship then the variable name needs to be specific, e.g. biologicalReplicate= and technicalReplicate =

Be nice to see if anyone else has an opinion on this too?

⁣Sent from Blue ​

On Jun 15, 2019, 2:20 AM, at 2:20 AM, John Deck notifications@github.com wrote:

Thanks Chris... First... I updated the example using a publicly accessible one (doh!)

Second... Yes, think we're on the same page, but wouldn't it be better to just say something like replicateParent=event1 instead of "sample_relationship=isBiologicalReplicateOf:event1" ?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/GenomicsStandardsConsortium/mixs/issues/26#issuecomment-502213581

ramonawalls commented 5 years ago

Chris and John - In fact, we had a brief discussion at this meeting about the more general need to be able to specify relationships between samples, not just whether or not they are part of a replicate. The solution proposed above was intended to meet the immediate needs of the INSDC biosample databases, with a recognition that it is not an ideal long term solution. I fully support being able to create any sort of relationship needed between/among samples and being able to link samples to the events that create them

The reason for the seemingly redundant terms is that if the replicates are technical replicates, then ENA suggests creating only a single biosample for this there are multiple experiments (the experiments in fact representing the replicates). In that case, there is no need to create a parent event or relationship among samples, as in BioSamples DB there is only one sample. The ontologist in my is uncomfortable with this, since there are in fact multiple subsamples. Even if they don't need multiple BioSample IDs, there should be a better way to describe the replication process than just saying three experiments were done on one sample (which is ambiguous about whether or not subsampling took place).

I think we need more discussion before reaching a final decision on this. ENA needs to move forward quickly, but we should not let that push us to adopt a solution we won't be happy with in the long term.

only1chunts commented 3 years ago

@ndheilly just added a new term request #109 that is related to this.