GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

What should be Core terms? #77

Closed only1chunts closed 2 years ago

only1chunts commented 4 years ago

Following on from @lschriml comments in ticket #71

Let's define together what we want in the core, as the concept of core has advanced over time, with the inclusion of additional checklists. I would like to be able to distinguish core - mandatory set of terms, as this set is used to assign GSC keyword in INSDC, then core terms - terms that are used across all checklists and environmental packages. Also to consider how the core is put together: core for MIxS -- all checklists and core for specific packages.

Perhaps have MIxS Core - mandatory & non-mandatory - to be used across all checklists and packages then MIxS checklists - MIGS, MIMS, MIMARKS, MISAG, MIMAG, MIUViG as a new tab for the release

We could consider a tab for each checklist, as they are presented that was in the INSDC.

the "Core" terms will be a very reduced set of terms compared to the current version? i.e. just the 10 terms that are mandatory.

submitted_to_insdc
investigation_type
project_name
lat_lon
geo_loc_name
collection_date
env_broad_scale
env_local_scale
env_medium
seq_meth

With 7 more that are "Conditional mandatory" for all checklists:

env_package
source_mat_id
nucl_acid_ext
nucl_acid_amp
adapters
url
sop

Currently, the presentation of the MIxS checklists is currently as a single table of all checklists with a column for each giving the designation of the terms as; mandatory (M), conditional mandatory (C), optional (X), environment-dependent (E) or not applicable (-). If a term is "E" surely that should be moved to the relevant environmental package(s)? There are 3 terms that have "E" for all checklists

depth
alt
elev

To simplify the checklists tab in the spreadsheet it may be appropriate to split the individual checklists into their own tabs. So each checklist groups (MIGS, MIMS, MIMARKS, MISAG, MIMAG, MIUViG) each have their own set of Core-recommended terms, and the environmental packages are valid for use with ANY of the checklists. This should make it more obvious that any instance of a sequence being MIxS compliant should make use of: mandatory core terms + 1 set of checklist core terms + 1 set of package terms(/or a user-defined set of terms)

e.g. a 16S rRNA study from Water would have:

only1chunts commented 3 years ago

This needs more discussion but is perhaps more of a question of presentation rather than functionality so I suggest this is not a requirement to resolve before v6 release.

Essentially "Core" is currently a collection of all terms that are NOT environmental in nature, i.e. they don't fit in "environmental packages".

At present my personal preference would be to stop using the phrase "core" and instead present a complete list of terms by "section". Currently, the terms use the sections; Investigation, Environment, Nucleic acid sequence source, sequencing, and Sampling environment. I cannot find a definition of these sections yet and they do not fully cover all the usage of terms, so I suggest expanding the use of the sections to include:

Investigation - terms that link the sequence to larger projects and repositories, such as the accession number(s) in INSDC archives (e.g. investigation type, project name, and experimental factor) administrative - terms relevant only the use of MIxS packages and checklists (e.g. submitted to insdc, and environmental package) sampling environment - metadata terms about the environment in which the sequenced sample was collected/obtained specific sample metadata - non-environmental metadata specific to the actual specimen sampled and sequenced generic sample metadata - non-environmental metadata that is generic but relevant to the specimen sampled and sequenced, this may include facts about the species in general but that have not been specifically proven for the specimen sampled methodological - terms describing the methods employed for either sample collection or processing. analysis - terms describing the subsequent findings and analysis of the sequenced specimen

This will enable users to find terms more easily when designing their own checklists from our collection of terms.

For presentation, when we have the RDF version it will be reasonably easy to generate an individual view for every checklist/package combination (i.e. 11 checklists x 16 env_packages = 176 specific checklists)

As an organization, GSC, can then ratify each of those specific checklists (and remove/adjust any that are inappropriate?!)

We can provide the terms list split by sections and allow users to generate their own checklists for specific projects from the terms. We (GSC-CIG) can then discuss and ratify each new suggested checklist at regular intervals, and add to the GSC list if approved.

ramonawalls commented 2 years ago

Notes from Jan. 24 meeting:

Core was intended to inform checklists (for different types of genomes). They are environment neutral.

The short lis of required terms is required for all checklists.

Packages are environment specific. What is required for each environment varies.

We have a lot of new checklists and packages, and the border between them has blurred.

We should be able to use LinkML to organize terms according to section, checklist, and package.

We need to clarify usage of all of this in the paper to make it easier to understand.

Clarify what is important for usage versus development of new packages or checklists.

BioSample XML has a field for package. We need to look at how packages and checklists are represented in INSDC.

NCBI:

ENA:

ramonawalls commented 2 years ago

Plan for v.7: Clarify the sections, checklists, and packages in the linkml file so that INSDC-specific views can be generated as well as different packages and checklists more easily and consistently.

ramonawalls commented 2 years ago

At the CIG meeting on 8/23, we decided to move to editing MIxS with Schema Sheets, which provides a spreadsheet environment for editing LinkML schemas. As part of this change, we will no longer have separate sheets for core and packages, but put all terms (i.e. linkml slots) in a single sheet.

Although there is still still some distinction between non-environmental and environmental terms, as @only1chunts mentioned above, that line has blurred. Nonetheless, there is still an important distinction between checklists (for different sequence types) and packages (for different sample types or sampling environments). We will use linkml to specify which terms belong in which packages and checklists.