INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
114 stars 27 forks source link

KGCL yaml statistics output should be deterministic #529

Open matentzn opened 1 year ago

matentzn commented 1 year ago

Running somthing like

runoak --stacktrace -i simpleobo:ontologies/maxo_2022-06-24.obo diff -X simpleobo:ontologies/maxo_2023-03-09.obo --statistics -o stats/maxo_diff.txt.yaml

Results in non deterministic serialisation of stats/maxo_diff.txt.yaml.

image

cmungall commented 1 year ago

we should never do yaml.dump(obj) always specify sort keys false

cmungall commented 1 year ago

it turns out that sort_keys=False doesn't fix - on reflection, not unexpectedly, this is a key-value list whose ordering is not defined by the main schema

if we want canonical ordering we have to define it. We can explicitly add ranks to the schema but this is hard for evolution (e.g. when inserting a new type). I suggest something simple like depth first pre-order of the kgcl is-a hierarchy, with alphabetic sorting for sibs