GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
33 stars 20 forks source link

provide per-package slims of ENVO for env triads #118

Open cmungall opened 3 years ago

cmungall commented 3 years ago

This is something we are doing in NMDC that we want to push up to the standards level

Submitters often find it difficult to select the correct ENVO terms. This is compounded by the lack of suitable ontology browsing tools and the prevalence of spreadsheet-based data submission vs dedicated tools with intelligent context-aware support for term selection that we see in other areas of biocuration. This is also made difficult by ENVO's move away from a system whereby each term came from one of three hierarchies. Things are more open-ended now, which leads to more submitter/annotator confusion. This is in evidence from the extremely poor quality of ENVO annotations in INSDC.

As a partial solution we should have recommended slims for each package/field combination. Submitters/annotators can still select terms outside these fields but these would serve as the starting point. Even if submitters restrict themselves to the selected fields then I hypothesize the gain in accuracy would vastly overcome loss in precision.

I suggest a 3 column format

An entry in this table means that the ENVO term is valid for the package/field combination

We could also have:

If we want to rename some of the more abstract ENVO labels in a local context

(this format also cleanly maps to the LinkML YAML format, which is how I envision us maintaining this moving forward)

This can also be easily implemented via dropdowns in spreadsheets

We in NMDC can get us started with a selection for soil

Note that as tooling becomes more sophisticated we can have less primitive ways of guiding users to the right terms but we have to start with something that works within the current tooling ecosystem

lschriml commented 3 years ago

Can this be ready to implement in MIxS 6 by May ?

Sent from my iPhone

On Mar 10, 2021, at 2:17 PM, Chris Mungall notifications@github.com wrote:

 This is something we are doing in NMDC that we want to push up to the standards level

Submitters often find it difficult to select the correct ENVO terms. This is compounded by the lack of suitable ontology browsing tools and the prevalence of spreadsheet-based data submission vs dedicated tools with intelligent context-aware support for term selection that we see in other areas of biocuration. This is also made difficult by ENVO's move away from a system whereby each term came from one of three hierarchies. Things are more open-ended now, which leads to more submitter/annotator confusion. This is in evidence from the extremely poor quality of ENVO annotations in INSDC.

As a partial solution we should have recommended slims for each package/field combination. Submitters/annotators can still select terms outside these fields but these would serve as the starting point. Even if submitters restrict themselves to the selected fields then I hypothesize the gain in accuracy would vastly overcome loss in precision.

I suggest a 3 column format

package field (env_X) valid ENVO term An entry in this table means that the ENVO term is valid for the package/field combination

We could also have:

package field (env_X) valid ENVO term ENVO local name If we want to rename some of the more abstract ENVO labels in a local context

(this format also cleanly maps to the LinkML YAML format, which is how I envision us maintaining this moving forward)

This can also be easily implemented via dropdowns in spreadsheets

We in NMDC can get us started with a selection for soil

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

only1chunts commented 3 years ago

I love the idea and agree it would probably make a huge difference to the ease of use, however, I think it's a massive undertaking to generate all the required slims and have them vetted by relevant user-groups for every environmental package before May. We can make a start as soon as anyone has the bandwidth to do so, but I think it's unrealistic to have it ready for public consumption by May. We should schedule its release for the MIxS v7 instead. Can the suggested slims be treated in the same way as our controlled vocabulary fields, or to put it another way, can our other controlled vocabulary fields use the same technology as these slims? After all, a CV is just a slim of the English language!

pbuttigieg commented 3 years ago

This makes sense, connects to some of our subsets in ENVO.

I agree with @only1chunts that this is more likely to be a MIxS 7 target. However, I think we should release a general suggestion for further revision, rather than wait for full consensus.

lschriml commented 3 years ago

Sounds good.

Sent from my iPhone

On Mar 16, 2021, at 12:09 PM, Pier Luigi Buttigieg @.***> wrote:

 This makes sense, connects to some of our subsets in ENVO.

I agree with @only1chunts that this is more likely to be a MIxS 7 target.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.