GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

Update definitions for size_frac_low and size_frac_up? #566

Closed isanti closed 3 weeks ago

isanti commented 1 year ago

Current term details Please supply the current details of the term that you would like to update:

Term name: "size_frac_low" AND "size_frac_up"
Term ID: "MIXS:0000735" AND "MIXS:0000736"
Structured comment name: "size_frac_low" AND "size_frac_up" 
Definition: "Refers to the mesh/pore size used to pre-filter/pre-sort the sample. Materials larger than the size threshold are excluded from the sample" AND "Refers to the mesh/pore size used to retain the sample. Materials smaller than the size threshold are excluded from the sample"
Expected value: "value" AND "value"
Value syntax: "{float} {unit}" AND "{float} {unit}"
Example: "0.2   micrometer" AND "20 micrometer"
Preferred unit: "micrometer" AND "micrometer"
Package(s): "water" AND "water"

Suggested update(s) Please supply the new suggestions for any of the details listed below (only insert text to those details that should be updated):


Term name: "size_frac_low"
Definition: "Refers to the mesh/pore size used to retain the sample. Materials smaller than the size threshold are excluded from the sample" 
AND
Term name: "size_frac_up"
Definition: "Refers to the mesh/pore size used to pre-filter/pre-sort the sample. Materials larger than the size threshold are excluded from the sample"

**Additional context**
I believe the definitions for terms "size_frac_low" AND "size_frac_up" are reversed. 
"size_frac_low" should be the "Refers to the mesh/pore size used to retain the sample. Materials smaller than the size threshold are excluded from the sample"
"size_frac_up" should be the "Refers to the mesh/pore size used to pre-filter/pre-sort the sample. Materials larger than the size threshold are excluded from the sample"
For water filtration, we usually refer to the upper threshold as the pre-filtration threshold, i.e., anything larger than the upper threshold is excluded from the sample and anything smaller that the upper threshold passes through.
And, we usually refer to the lower threshold as the pore-size to retain the sample, i.e., anything smaller than the lower threshold is excluded from the sample and anything larger than the lower threshold remains on the filter.
Anything other than the definitions looks fine, even the examples look ok if the definitions were in reverse. 
mslarae13 commented 6 months ago

see https://github.com/GenomicsStandardsConsortium/mixs/issues/566 https://github.com/GenomicsStandardsConsortium/mixs/issues/699 https://github.com/GenomicsStandardsConsortium/mixs/issues/566

cpavloud commented 3 months ago

Adding a comment, because I believe this issue really needs to be resolved.

The GSC definitions in the MiXS website for size_frac_low and size_frac_up are reversed, they are not logical, as @isanti also mentioned when opening the issue.

However, the definitions in the ENA checklists are correct: size-fraction lower threshold: Refers to the mesh/pore size used to retain the sample. Materials smaller than the size threshold are excluded from the sample size-fraction upper threshold: Refers to the mesh/pore size used to pre-filter/pre-sort the sample. Materials larger than the size threshold are excluded from the sample.

And this causes a discrepancy.

lschriml commented 3 months ago

Before these changes are made, let's contact the original group that worked on this. I would suggest Frank Oliver Glockner, Renzo Kottmann. Linda Amaral Zettler.

We have to be cautious about editing these fields. We can misconstrue what was meant when the fields were created.

lschriml commented 3 months ago

We should also look into how the field has been populated.

turbomam commented 3 months ago

As of February, there were 41,147 Biosamples (out of 37,572,120) where either size_frac , size_frac_low or size_frac_up was not null (from an SQL perspective). I asked the Gemini LLM to summarize it for me, that that error-ed out. I can share the full file or I can work on summarizastion over the next day or two.

mslarae13 commented 3 months ago

@lschriml are you concerned that there's just a misunderstanding of how to use the slots, and not a mistake in the descriptions? I did recently submit a PR to make the change. Maybe we should discuss at CIG

lschriml commented 3 months ago

I want to check that we document why things are being changed, so that we can trace it back, if asked.And to determine, when/how it got changed before. Sent from my iPhoneOn Jul 19, 2024, at 6:12 PM, Montana @.***> wrote: @lschriml are you concerned that there's just a misunderstanding of how to use the slots, and not a mistake in the descriptions? I did recently submit a PR to make the change. Maybe we should discuss at CIG

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

lschriml commented 3 months ago

It may also be useful to be clear the vocabulary used for LinkML and the equivalent terms used in MIxS, so that is transparent to the world what we are discussing, and to avoid any confusion.Sent from my iPhoneOn Jul 19, 2024, at 7:46 PM, Lynn Schriml @.> wrote:I want to check that we document why things are being changed, so that we can trace it back, if asked.And to determine, when/how it got changed before. Sent from my iPhoneOn Jul 19, 2024, at 6:12 PM, Montana @.> wrote: @lschriml are you concerned that there's just a misunderstanding of how to use the slots, and not a mistake in the descriptions? I did recently submit a PR to make the change. Maybe we should discuss at CIG

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

cpavloud commented 3 months ago

In case it is helpful, I noticed some things:

-- In the original MIMARKS publication, in the Supplementary Results 2, there are no terms for size_frac_low and size_frac_up in the water checklist -- If I am not mistaken, the terms first appeared in mixs5 -- size_frac_low and size_frac_up are also included in the FoodFarmEnvironment and in the Agriculture checklists -- NCBI has kept the original MIxS definitions (which creates even further discrepancies since ENA has reversed the definitions)

Woolly-at-EBI commented 3 months ago

Brilliant, thanks for moving this on Montana.

Good, yes worth discussing at the CIG today. Good point Lynn, yes the original group may have had a different train of thought in their heads. Has anyone asked them? Shall I? (I don't know them though) Agree, such changes need to be tracked. At ENA we now have a change log, but it is does not always catch the why.

Our(ENA's) marine data expert (Stephane Pesant) was the one who originally flagged the size fraction discrepancy. It is so valuable having people in the office who are actively preparing and submitting samples via different checklists like Stephane, and also others who are on the frontline dealing with helpdesk queries. This came up when we consolidating older terms in the "ENA Tara Oceans" checklist with conceptually similar terms in MIxS6.2. (you are right Christina those size fraction terms were in MIxS 5.0.) We still have much technical metadata debt in ENA, on top of that in GSC - collectively important to increase FAIRness.

Christina, you have me thinking further now what to do with similar in the future, where a definition change is major such as this case about the mesh/pore size. As you indicate it then makes it inconsistent across INSDC and others using the GSC MIxS, until others have made the change too. I am going to raise this as a minor discussion point at our weekly internal ENA content meeting later this morning. My natural and our ENA group's tendency is to fix what are considered errors and move on, but yes we may have misunderstood what the original authors meant.

cpavloud commented 3 months ago

@Woolly-at-EBI I think Stephane was absolutely right flagging the discrepancy and I think ENA is also absolutely right in having switched the definitions. Because this is what makes sense (from the scientist perspective).

If you check the examples provided for these terms, you will see that for size_frac_low, the example value is 0.2 micrometer and for size_frac_up it is 20 micrometer. Which is perfectly fine, as it should be, and makes me think that something like a wrong copy-pasting was the reason why the definitions are reversed.

cpavloud commented 3 months ago

Just to comment (again) that NMDC also uses the (wrong) definitions for size_frac_low and size_frac_up.

lschriml commented 3 months ago

The water package was one of the first ones we created. I would suggest checking with Pelin Yilmaz, as she led these efforts. @pyilmaz (pyilmaz.mgx@gmail.com) It looks like whatever we had for water was then copied to the other packages.

I will also forward this to our GSC board members.

For documentation, let's see if we can find these definitions in publications, marine sites.

[MIxSv6_release.xlsx](https://github.com/user-attachments/files/16379132/MIxSv6_release.xlsx)

I was curious how these terms were listed in MIxS 6 release (attached):

-- the terms are in 3 packages (food-farm environment, water, agriculture): size_frac_low. (size-fraction lower threshold) definition: Refers to the mesh/pore size used to pre-filter/pre-sort the sample. Materials larger than the size threshold are excluded from the sample Example: 0.2 micrometer

size_frac_up (size-fraction upper threshold) definition: Refers to the mesh/pore size used to retain the sample. Materials smaller than the size threshold are excluded from the sample Example: 20 micrometer

Cheers, Lynn

cpavloud commented 3 months ago

Who should contact Pelin Yilmaz?

lschriml commented 3 months ago

I have emailed Pelin and the board.

On Thu, Jul 25, 2024 at 11:05 AM Christina Pavloudi < @.***> wrote:

Who should contact Pelin Yilmaz?

— Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/566#issuecomment-2250612805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBB4DK7MIKILYSM32ZVCRLZOEH2XAVCNFSM6AAAAAAYSVTVKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQGYYTEOBQGU . You are receiving this because you were mentioned.Message ID: @.***>

-- Lynn M. Schriml, Ph.D. Associate Professor

Institute for Genome Sciences University of Maryland School of Medicine Department of Epidemiology and Public Health 670 W. Baltimore St., HSFIII, Room 3061 Baltimore, MD 21201 P: 410-706-6776 | F: 410-706-6756 @.***

only1chunts commented 3 months ago

as a current board member, my vote goes with the need for a correction to be made in the GSC definitions of those two terms. I think the clincher is the fact the example values included show the intent. If you use those example of size_frac_low=0.2 and size_frac_up=20, then it is obvious you want to exlude particles outside that range. But I also think the names and definitions could be made clearer somehow.

mslarae13 commented 3 months ago

@cpavloud I tried to tag you in a comment in my PR, but your username wasn't showing up. I'll try again. But once we hear back from Pelin & get a board approval, we'll be able to merge in this PR and it'll be part of the next release.

mslarae13 commented 2 months ago

@only1chunts @lschriml Any word from Pelin or the board?

lschriml commented 2 months ago

Not yetSent from my iPhoneOn Aug 13, 2024, at 4:58 PM, Montana @.***> wrote: @only1chunts @lschriml Any word from Pelin or the board?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

mslarae13 commented 2 months ago

OK, we're approaching a month since we started this discussion. At what point do we consider the new community feedback, that these term descriptions are backwards and confusing the current priority and make the change?

Considering the "living standard" development of MIxS, it doesn't seem invalid to take new feed back and make improvements even if they were not the original knowledge. We do need to make sure that there's VERY good documentation and clarification on the change. As for the INSDC implementation, considering @Woolly-at-EBI was one of the people that submitted the issues about it being backwards, I'd expect we can trust the individual implementations of GSC to manage the update as well.

mslarae13 commented 2 months ago

For what it's worth, I just looked at v5 and v4. v4 doesn't have these terms & v5 has them incorrect as well.

lschriml commented 2 months ago

Let's wrap this up at the next CIG.

On Tue, Aug 13, 2024 at 5:26 PM Montana @.***> wrote:

For what it's worth, I just looked at v5 and v4. v4 doesn't have these terms & v5 has them incorrect as well.

— Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/566#issuecomment-2287165911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBB4DKHB3ZN53RMSILBCWTZRJ2YRAVCNFSM6AAAAAAYSVTVKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBXGE3DKOJRGE . You are receiving this because you were mentioned.Message ID: @.***>

-- Lynn M. Schriml, Ph.D. Associate Professor

Institute for Genome Sciences University of Maryland School of Medicine Department of Epidemiology and Public Health 670 W. Baltimore St., HSFIII, Room 3061 Baltimore, MD 21201 P: 410-706-6776 | F: 410-706-6756 @.***

mslarae13 commented 1 month ago

Move forward with the change. Capture notes and comments about why and provide VERY clear and well described notes.