flotang-gtt / ThermML

2 stars 0 forks source link

Requirements for identifier names #2

Open flotang-gtt opened 1 month ago

flotang-gtt commented 1 month ago

Some of the database files have different requirements for e.g. phase names, e.g. character string lengths or allowed/forbidden characters.

While generally I am against most of these restrictions and think that implementing software should maintain these requirements internally, some requirements may be reasonable to incorporate into the schema.

A potential restriction that I think could be valuable on the schema level would be the uniqueness of identifiers. E.g. in ChemSage.dat, phase constituents (=endmembers) should not have identical names. This may be left to the database developers, but in my opinions, requiring this on a schema level has benefits.

johanzietsman-em commented 1 month ago

In the schema content that I added so far, I make provision for

  1. an XML identifier (id) that is forced to be unique by the schema; and
  2. a name/symbol for display purposes and other uses, like for CALPHAD software.

The latter currently has no restrictions, which means we can run into an issue of duplicated names. I am suspect that we cannot achieve unique names in an XML file with the xs:ID datatype, because it is most likely too restrictive in the character set that it allows.

I would recommend that we exclude some characters from valid phase/constituent/system component names. One example is double quotes (") and single quotes ('). These can create problems when doing string processing in some languages, and exporting to formats such as json and yaml.

richardotis commented 2 weeks ago

I'll refer to Johan's identifier types mentioned above as "Type I" and "Type II" identifiers. To start, I would propose that a Type II phase identifier should follow the same rules as Python 2 names, particularly that they follow this regex: ^[^\d\W]\w*\Z - longer term I would like to see support for Python 3 names, i.e., full Unicode, but I anticipate several implementation challenges with this, and it's easier to start with something more restrictive and relax it later, than the opposite.

For forbidden names, I would agree with Florian that there should be logical restrictions on Type II identifiers too, such as the names being unique within a database, and not matching an element, species, or other constituent.

For Type I identifiers, this is starting to get close to the idea of a centralized table of phase prototypes. I'm not sure if that would be in scope for this, but I'll say I'm at least notionally open to the idea. You would need extremely strong versioning for such a table, and an easy way to map the evolution of an identifier over time, so that you could correlate a structure known to be "Kind A" in Year N, but then later argued to be "Kind B" in Year N+1. You can see how this would very quickly become a ton of work (Bengt Hallstedt's binary collection contains such a phase table, and it has over 4,000 entries), but in principle it's possible. The benefits of such a scheme would be substantial for the use case of combining literature assessments.