GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

request for guidelines on reporting mean and standard deviation #95

Open msweetlove opened 3 years ago

msweetlove commented 3 years ago

A request to specify guidelines (not necessarily new terms required)

MIxS lacks clear guidelines on how to report values as a mean +- standard deviation when the expected input for a term is a single numeric value (e.g. for conduc, temp, pH,...) I see some users use the "±" sign, but this unicode character generates issues in software that do not support it (like R, where it is rendered as ). Could some guidelines be drafted and added to the MIxS document/website on how to deal with this? One possible syntax to write this down is as {float};SD{float} (example: 1.665;SD0.004), with the ";" to be able to separate the mean value from the rest (also preferred ";" to "|", as "|" has another meaning in regular expressions), and SD to indicate standard deviation.

pbuttigieg commented 3 years ago

I vote against complicating the syntax any more.

MIxS in general has defined syntax in a non-standard way, that makes using it require custom parsers, which is bad practice.

There should be dedicated fields for, e.g., error and units (if not restricted) so that there's a single value (numeric, boolean, categorical, URI, etc) in each field.

Further, if we ask for mean/sd, we must include guidance that notes that the (near) normality of the signal distribution has been confirmed. It's misleading and utter nonsense to include these values if not.

We should consider replacing all of these with generic fields for location and spread, with an additional field for the submitter to indicate what those were (means, median, mode, range, sd, etc).

If that's too burdensome, then we should come up with generic guidance of what kind of estimate is valid in these fields, but it has be to far more robust than an arithmetic mean (median and range is better)