GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
34 stars 20 forks source link

add 'Sample Name' to MIxS core #78

Closed lschriml closed 3 years ago

lschriml commented 3 years ago

New term details For us to assess a new term request we require the following details:

Term name - sample name 
Term ID - MIXS:0001107
Structured comment name - [a less than 20char no spaces version of the name]: sample_name 
Definition - [a clear and concise description of the term including ]
   INSDC definition: Sample Name is a name that you choose for the sample. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every Sample Name from a single Submitter must be unique. 

Expected value - [e.g. text or EFO and/or OBI etc...]
Value syntax - [e.g. {float} {unit}|{termLabel} {[termID]}|{text}|{timestamp} etc...]
   {text}
Example - [provide and example value]
Preferred unit - [if appropriate]
Package(s) - [list any packages that should include the new term]
MIxS core

Additional context Add any other context about the new term here.

lschriml commented 3 years ago

AI: add to the core, add to each package for MIXS v6.0

dehays commented 3 years ago

Will sample_name be a mandatory or optional core term in V6?

lschriml commented 3 years ago

source material id = intended to be their globally unique identifier

sample name, yes, make it a mandatory field

msweetlove commented 3 years ago

could you define what a "sample" means? Does this refer to 1) an physical object (i.e. a piece of tissue or a chunk of soil in a laboratory's freezer or that was destroyed to extract and sequence the DNA), or 2) a digital object resulting from the analysis of a physical object (i.e. the DNA sequences stored on a hard drive)?

I think it is important to distinguish between these things, as DNA can be extracted and sequenced from different subsamples of a physical sample (e.g. a nested design of 'replicates' from an environmental soil sample to look at micro-scale differences bacterial community composition => in this case there are multiple (digital) DNA "samples" related to a single (physical) soil sample). I'm not sure the term sample_name could capture this complexity and may cause confusion when confronted with such cases...

Would it be a solution to do something in the line of dwc:event/dwc:eventID (http://rs.tdwg.org/dwc/terms/Event) and dwc:parentEventID (http://rs.tdwg.org/dwc/terms/parentEventID) in DarwinCore? There, an "event" (e.g. a sampling campaign, environmental sample or DNA sequences from the bacterial community in that environmental sample) can be embedded in a parent event. Each parent event can in turn be listed as an event with it's own parent event. For example the DNA sequences (event A1) can come from an environmental soil sample (S1) collected during a campaign (C1), the relation between these events would then be C1:S1:A1.

lschriml commented 3 years ago

added to MIxS core, checklists and packages

only1chunts commented 3 years ago

comment from @msweetlove has not been addressed yet. Much of the issue raised by is actually about sample relationships, which is also discussed in #109 I suggest we move forward with this term as suggested, for v6. The sample relationship issue still needs more discussion

ramonawalls commented 3 years ago

I agree that we need to move forward with this for MIxS6 and address relations elsewhere, however, @msweetlove raises a few important questions about what a sample is that are relevant to this term. I also suggest updating the definitions of sample name and source material id to clarify the difference between them. For sample name, I copied some text from source_mat_id to clarify that a sample name refers to a material sample.

I don't think sample name needs to be required, as it is generally the meaningless local ID. @lschriml do you have more input on why it should be required?

Proposed changes:

Term name - sample name 
Term ID - MIXS:0001107
Structured comment name - sample_name 
Definition -  A local identifier or name that for the material sample used for extracting nucleic acids, and subsequent sequencing. It can refer either to the original material collected or to any derived sub-samples. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. INSDC requires every Sample Name from a single Submitter to be unique. Use of a globally unique identifier for the field source_mat_id is preferred over sample_name.
Expected value - string
Value syntax -    {text}
Example - RLW103
Preferred unit - 
Package(s) - core
Term name - source material identifier (NOTE: was source material identifiers, plural)
Term ID - MIXS:0000026
Structured comment name - source_mat_id
Definition -  A unique identifier assigned to a material sample (as defined by http://rs.tdwg.org/dwc/terms/materialSampleID, and as opposed to a particular digital record of a material sample) used for extracting nucleic acids, and subsequent sequencing. The identifier can refer either to the original material collected or to any derived sub-samples. The INSDC qualifiers /specimen_voucher, /bio_material, or /culture_collection may or may not share the same value as the source_mat_id field. For instance, the /specimen_voucher qualifier and source_mat_id may both contain 'UAM:Herps:14' , referring to both the specimen voucher and sampled tissue with the same identifier. However, the /culture_collection qualifier may refer to a value from an initial culture (e.g. ATCC:11775) while source_mat_id would refer to an identifier from some derived culture from which the nucleic acids were extracted (e.g. xatc123 or ark:/2154/R2). **No change.**
Expected value - URI form of the permanent identifier
Value syntax - {PMID}|{DOI}|{URL}
Example - https://n2t.net/ark:/21547/Cjl2RHODO_SJP_131
Preferred unit - 
Package(s) - core
lschriml commented 3 years ago

Hello Ramona, Sample name is a required field for INSDC. We discussed it on one of the CIG calls and agreed to add it.

Cheers, Lynn

On Thu, May 20, 2021 at 12:35 PM Ramona Walls @.***> wrote:

I agree that we need to move forward with this for MIxS6 and address relations elsewhere, however, @msweetlove https://github.com/msweetlove raises a few important questions about what a sample is that are relevant to this term. I also suggest updating the definitions of sample name and source material id to clarify the difference between them. For sample name, I copied some text from source_mat_id to clarify that a sample name refers to a material sample.

I don't think sample name needs to be required, as it is generally the meaningless local ID. @lschriml https://github.com/lschriml do you have more input on why it should be required?

Proposed changes:

Term name - sample name Term ID - MIXS:0001107 Structured comment name - sample_name Definition - A local identifier or name that for the material sample used for extracting nucleic acids, and subsequent sequencing. It can refer either to the original material collected or to any derived sub-samples. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. INSDC requires every Sample Name from a single Submitter to be unique. Use of a globally unique identifier for the field source_mat_id is preferred over sample_name. Expected value - string Value syntax - {text} Example - RLW103 Preferred unit - Package(s) - core

Term name - source material identifier (NOTE: was source material identifiers, plural) Term ID - MIXS:0000026 Structured comment name - source_mat_id Definition - A unique identifier assigned to a material sample (as defined by http://rs.tdwg.org/dwc/terms/materialSampleID, and as opposed to a particular digital record of a material sample) used for extracting nucleic acids, and subsequent sequencing. The identifier can refer either to the original material collected or to any derived sub-samples. The INSDC qualifiers /specimen_voucher, /bio_material, or /culture_collection may or may not share the same value as the source_mat_id field. For instance, the /specimen_voucher qualifier and source_mat_id may both contain 'UAM:Herps:14' , referring to both the specimen voucher and sampled tissue with the same identifier. However, the /culture_collection qualifier may refer to a value from an initial culture (e.g. ATCC:11775) while source_mat_id would refer to an identifier from some derived culture from which the nucleic acids were extracted (e.g. xatc123 or ark:/2154/R2). No change. Expected value - URI form of the permanent identifier Value syntax - {PMID}|{DOI}|{URL} Example - https://n2t.net/ark:/21547/Cjl2RHODO_SJP_131 Preferred unit - Package(s) - core

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/78#issuecomment-845274352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBB4DI6XJ7G7Q5UWKNJFL3TOU26RANCNFSM4QFLJVNA .

-- Lynn M. Schriml, Ph.D. Associate Professor

Institute for Genome Sciences University of Maryland School of Medicine Department of Epidemiology and Public Health 670 W. Baltimore St., HSFIII, Room 3061 Baltimore, MD 21201 P: 410-706-6776 | F: 410-706-6756 @.***

ramonawalls commented 3 years ago

Thanks @lschriml ! Makes perfect sense.

In that case, I will update my suggestion for sample name slightly to below:

Term name - sample name 
Term ID - MIXS:0001107
Structured comment name - sample_name 
Definition -  A local identifier or name that for the material sample used for extracting nucleic acids, and subsequent sequencing. It can refer either to the original material collected or to any derived sub-samples. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. INSDC requires every sample name from a single Submitter to be unique. Use of a globally unique identifier for the field source_mat_id is recommended in addition to sample_name.
Expected value - string
Value syntax -    {text}
Example - RLW103
Preferred unit - 
Package(s) - core
only1chunts commented 3 years ago

@ramonawalls , I have updated the definition in the all Tabs of the spreadsheet as per your suggestion above, using the find and replace all option in google-sheets, it said it did 26 replacements in total.