NCEAS / eml

Ecological Metadata Language (EML)
https://eml.ecoinformatics.org/
GNU General Public License v2.0
40 stars 15 forks source link

Review units list for duplicates, errors #378

Open amoeba opened 2 years ago

amoeba commented 2 years ago

@earnaud over on https://github.com/ropensci/EML/issues/330 reported seeing a unit with the id molePerKilogram twice in the units list but with slightly different attributes. You can see it's defined twice:

https://github.com/NCEAS/eml/blob/fe77f8f9a34b08bc181857d1ad1240bcd99bead4/eml-unitDictionary.xml#L1623-L1627 https://github.com/NCEAS/eml/blob/fe77f8f9a34b08bc181857d1ad1240bcd99bead4/eml-unitDictionary.xml#L2017-L2020

The first time, it's grouped with <!--amountOfSubstanceWeight--> and the second time it's grouped with <!--amountPerMass-->.

It's also listed twice in the eml-unitTypeDefinitions.xsd file:

https://github.com/NCEAS/eml/blob/fe77f8f9a34b08bc181857d1ad1240bcd99bead4/xsd/eml-unitTypeDefinitions.xsd#L378-L379

@mbjones, @mobb: Does this seem like a mistake to you too?

@earnaud also indicates there were other issues but I haven't figured out what those are just yet.

Once fixed, we need to issue a re-release of EML and emld. emld is where the schema files are shipped.

earnaud commented 2 years ago

Hi, I opened a full issue on https://github.com/ropensci/EML/issues/343 and provided a unit file I worked to give EML more units. I think I could serialize this table into XML format if required.

mbjones commented 2 years ago

@amoeba Yes, I think the duplication is an issue, and bummer that we missed it. It appears to me there is little functional effect because they are defined the same way (with the exception of the use of the word "micromoles" in the second description). So, I think we should eliminate one of the two, which could go out in a patch release of the spec (e.g., 2.2.1).

@earnaud In terms of completeness, we have never thought the EML spec could define all units needed by researchers, but were striving to provide shared names for the most commonly used units, and a spec that allows additional units to be added as needed. Your ticket on how to get the R EML and emld packages to fold in udunits automatically I think is best handled there, rather than as part of the spec. If there are specific units that you think should be added to the spec in a new release because they are super common, then I think proposing those as feature requests here in the spec repo makes sense. But maybe getting "everything covered by udunits" should be more of a tooling issue rather than a spec issue. What do others think?

mbjones commented 2 years ago

@amoeba I also see that the unitType is blank on a bunch of those "amountPerMass" fields, which is wrong -- it should be set to unitType="amountofSubstanceWeight". That unitType is poorly named, and I think it would be better names as amountPerMass, but it wasn't, so I'm not sure changing it makes sense now. @mobb, do you have any input on this situation and how we should move forward?

mobb commented 2 years ago

I confess I did not spend much time on the unitType field. I found it to be somewhat overloaded, combining features of quantity and dimensionality. Further, unitType did not seem to be widely used.

BTW, EDI has recently spun up a Units Working Group, to address the future for all the content of the (now retired) LTER Unit Registry. In particular, we would like to partner with a larger org dealing with units, and come up with a way to suggest new additions to that system, and to export from it in ways that are compatible with EML. We have just begun examining a group of systems (udunits among them) for certain features. Our WG does not have a web-presence yet - contact me if you're interested in joining this effort.

mbjones commented 2 years ago

Thanks, @mobb. I'm interested, or maybe someone from our group at NCEAS might be.

Regarding unitType, it is a critical field that links the unit to a dimensional formula. For example, for amountOfSubstaneWeight:

https://github.com/NCEAS/eml/blob/fe77f8f9a34b08bc181857d1ad1240bcd99bead4/eml-unitDictionary.xml#L205-L209

Any two units that share the same unitType or have unitTypes with identical dimensionlity are in fact the same kind of measured quantity, and can therefore be converted losslessly between them using the multiplierToSI factor. While we generally only annotate with unit values, it is the unitType linkage that allows us to semantically group units and determine if they are from the same dimensional family. So it is used more behind the scenes in inferences about units and driving unit conversions. This is also what would allow us to automate the linkage to other unit vocabularies build from the NIST fundamental dimensions.

earnaud commented 2 years ago

Hi @mbjones ,

Indeed, I naively worked on the tables returned by EML::get_unitList() and didn't think to look how the function actualy worked. Therefore, I shall turn my table into an xml and review the units list with my users communities to assess which ones will be the most useful.