pure phases (and functions, by proxy)

flotang-gtt commented 1 week ago

I've been finally able to write up some suggestions. Feel free to challenge my decisions!

My suggestion encompasses:

'vendor' specific functions should be as easily as possible copied from existing databases. That means that we get TDD and ChemSage.dat specific function nodes. I would suggest these as transitive elements, to ease migration. My idea would be that the format should be able to ingress this type of function nodes, but at least in e.g. my implementation, I would only support the h-s-cp format for output. I think this decision will be fruitful both in situations for manual conversion/manipulation of databases, as well as testing and general acceptance (e.g. if one can demonstratively "show" that the function blocks are convertable in that regard, this is great for integrity.
I removed the 'cmpXX' id type for system components (and will like to follow up on this with some of the other IDs, depending on your feedback). I think the validation of these IDs can be realized identically, given with the changes that I implemented. I simply don't see the advantage of separating the name and ID of a system component, as I can't see a scenario where this would be warranted.
Had a look at how pycalphad employs the ascii<->AST conversion of TDBs, and especially for output. As long as there is no general support of e.g. SymEngine's style (and I also do not know if this allows lossless roundtrips?), I am hard pressed how to find a good-enough and fairly-straight conversion of the polynoms. MathML is basically unreadable, even though it is fully able to express the types of functions I encountered. XML-internal validation is likely out of the window anyways. I settled on the h-s-cp version, which goes somewhere in between everything - but it allows much less nesting or more complex expressions (e.g. A*B*T). Similar to how the Fact universe works, it also only allows functions to be referenced as linear factors in any term, not e.g. as exponent or argument to symbolic expressions. Very open to suggestions!
Does anyone see an issue with using <energy> for the expression? It feels slightly off to me. Let me know if you have better ideas.

Known issues (not mainline to this PR, feel free to add issues or spin off branches to discuss)

String validations for names etc are still a bit all over the place.
Function ID needs to be properly set to IDREF type for validation.
Punting the whitespace formatting and more detailed documentation for now. Will try to fix this before merging.
The template hierarchy and especially the places where 'type' definitions are kept will need to be overhauled.
It may be needed to revisit the "pure phase" denomination, given the fuzzy edges of some definitions.

richardotis commented 1 week ago

(I edited your post to number your points to make it easier to reply, hope that's alright.)

I started reading the code before the post, so I was ready to push back heavily on this, but as an optional tool for migration, I think I'm overall neutral on the idea. In principle PyCalphad or another tool will be able to perform any format conversions that would be needed without direct embedding in the XML, but I do understand that a degree of copy-and-paste compatibility would improve the experience for database developers. I can also see the value for rapidly constructing heterogeneous databases incorporating information from both ChemSage and TDB formats. Also, implementations of this could have the capability to export/write-out all data in hscp format.
Agreed
I think this is ok. Math strings are nice to write but you can end up with unbounded complexity. For hscp you might consider enumerating in the schema all of the allowed coefficient types (every power of T allowed, natural log, etc). This would provide some assurance to implementations that they haven't missed anything, and also provides a common touch point for proposals to add new functional forms in the future.
In principle I think this format should extend to more properties than just the energy. I might suggest something like <property name='energy'></property>

I think <range> should have an attribute that says it's a range over temperature. You could make it the only allowed attribute in the schema, but it would give a path to extending it in the future.

flotang-gtt commented 1 week ago

Thanks for the feedback.

Yes, I was thinking about that. I'll be doing that, after all, the terms to enumerate aren't that many, and explicitly listing them is going to help with validation.
I like your suggestion. I definitely wanted to include the already implemented potentials, simply forgot about it. I'll provide some update in that direction.

johanzietsman-em commented 5 days ago

Agreed
I added the "cmp" prefix to, for example, distinguish system component Fe (cmpFe) from substance Fe (sbsFe) of phase constituent Fe (pcnFe). If we want to use XML Schema's validation strengths, we can use ID and IDREF types like this. It does come at the expense of using unique ids, though. When referning with IDREF "cmpFe" rather than "Fe", it will be clear that the reference refers to a system component. (We have been using sc as our in-house shorthand for system component, which is perhaps slightly more compact. I decided to use a 3-letter prefix allow more flexibility in developing the schema.) The principle behind this may be "explicit is better than implicit". Also, Thermo-Calc tends to use all caps for element symbols. Separating the ID from the symbol allows keeping IDs unchanged but changing names according to one's preferences for display purposes. The principle here is perhaps "separation of concerns". "id" is used for identification, while "symbol" is used for display purposes.
No comment
No comment

flotang-gtt commented 4 days ago

OK, I have some contributions.

I'll leave this as is for now. Thanks for your comments.
Johan, I like your arguments. However, wouldn't the addition of an e.g. 'alias' attribute (which I like a lot) help? That would manage the TDB "Exemption" of allcaps just fine, I think. I am not completely sure I can follow your example of validation. Maybe it's a thing of our individual toolchains, but at least using the XML Schema extension of VSCode, the 'Fe' would be validated against the existing names=".." tags of SystemComponents. Yes, that leaves a bit of potential misinterpretation open with respect to the 'reader' assuming whether it's a system component or .... and here I can't see what it would be confused with. It is validated schematically against that list. Am I missing something?
The more I think about the use, the more i am convinced that this is adequate. It is much simpler if e.g. you want to have a function that is composed of other functions that we should/could switch to another function type. Some simple types like this suggestion may be a good start, and we can expand later on.
The disadvantage to the extensible <property type="XY" >...* structure is that there's no easy way to seperate the potential child elements, meaning that e.g. all different property type elements are orchestrated in the same place. I am not sure how much of an issue that is, as it strongly hinges on the number of property elements that are going to be added. Maybe having less layers of indirection is even helpful.

johanzietsman-em commented 4 days ago

Let's say you add a system component with symbol "Fe" and symbol is of type "xs:ID". Next, let's assume that we create a liquid alloy phase, which has a phase constituent with formula "Fe" and that formula is of type "xs:ID". The XSD validation system will complain that you have items with duplicate identifiers. This is why I would recommend using unique identifiers. In this case we can use "scFe" for the system component, and "pcFe" for the phase constituent. (The phase constituents could suffer another uniqueness problem, since there may be other phases with Fe as constituent. I will not go into that further here.) I hope this makes it clear. I do believe the price of using explicit identifiers will be much less than the benefit of the full validation power of XSD.

It is easy to argue against following a strong set of principles initially in the development process. The value only becomes clear at scale. It is also very difficult, however, to start implementing a strong set of principles a long way down the road, since the retrofitting cost could be very high.

flotang-gtt commented 4 days ago

As I understood it, you are rightfully pointing out that, xs:id enforces global uniqueness. This is, as your example clearly shows, not desired. But if we were limiting it to the SystemComponentIDType type, the validation against that list is "unique", as long as we (which is not done, yet) assume uniqueness of SystemComponent elements - which I think can be done using the xs:unique element.

Enforcing 'type safety' that no user confuses a phase constituent and a system component should therefore be covered, as I don't think any value would have to allow both a SystemComponentIDType and a PhaseConstituentNameType. For each of these lists of "unique" (as far as that type encompasses) elements, no ambiguity exists.

Since the xs:IDREF type can only refer to xs:id, this necessitates to use xs:keyref - at least from my reading on this, there does not seem to be any downside? Am I overlooking something?

johanzietsman-em commented 3 days ago

Using xs:unique and xs:keyref, or perhaps rather xs:key and xs:keyref sounds like a nice, elegant solution. I have never used this before, and wondered why.

I had a look at the XML Schema specifications, and it appears that the above-mentioned data types were added in version 1.1 of the specification. I have not used that version before, since several of the mainstream XML libraries did not implement version 1.1 when I last had a look. lxml is one example.

I am keen to support the recommended xs:key+xs:keyref solution, if we are confident that the tooling for using it is sufficiently available. I have not done a recent survey of this. A quick search about XML Schema 1.1 in vscode yielded this result, which is a bit concerning.

richardotis commented 3 days ago

For Python at least, a supported solution for XML Schema 1.1 appears to be https://github.com/sissaschool/xmlschema

bocklund commented 3 days ago

Sorry I'm a little late to getting feedback in.

I agree with Richard's sentiment that the software tools should be the medium to convert from legacy database formats to the XML format. I'll push a little harder against having direct support for legacy formats in the XML databases for two main reasons. First, it forces each software tool that implements the XML to add support for the various flavors of legacy formats in the new format - this is what we're trying to get away from and would limit the adoption and compatibility, and continue to breed some of the issues in the legacy formats. Second, my concern is that having legacy support in the new format means that the legacy formats will never be obsolete because users will just keep copy and pasting the legacy features around in XML files.
I agree with the discussion to remove any disambiguating prefixes. In addition to the points from @flotang-gtt, I also think that encoding the type of a thing (a system component or a phase name) in the name of that thing should be discouraged as a guiding principle.

In my opinion, one of the mistakes of the legacy formats is to enforce correctness though parsing and validating the values of strings. For example determining how many sublattices a PHASE entry has in the TDB has by parsing the number, and then parsing the sublattice multiplicities themselves, rather than something declarative like just giving a list of sublattice multiplicities (where the number of sublattices is implied from the length of the list). Similar issue in the ChemSage DATs where you write explicitly the number of Gibbs energy equations that follow (and many other examples).
Unfortunately, I don't think there's a good enough cross-language support for anything much more structured and lightweight than expression strings. I lean more toward being permissive here, mostly out of necessity for the different capabilities of the various implementations regarding substitutions of other expressions, more complex expressions A*B*T, or use of different, or custom state variables.

Some specific feedback:
- I am a little hesitant to have function_of in the expressions, at least in a way that implies that each expression can only be a function of one thing (I think we talked about R*T*LN(P) before)
- I think it's okay, and I would encourage there to be structured data entry types like h-s-cp alongside integrated polynomial expressions. However, while I understand why something like <term coeff="80.0119918" fun="T^0" /> makes sense for particular implementations, I don't think I would recommend going this way for the user-facing schema. To me, this looks like the result of an intermediate representation that specific implementations would make on a user-facing expression 80.0119918*T^0 + -3546683.99888*T^(-2) + -240.275998928*T^(-0.5) + 491568369.44*T^(-3).
If I'm understanding correctly, I think we should prefer <property type="G"> to <property type="free-energy">. Specifically, I think we should keep the types as close to how they are expressed in papers and equations as possible (some chicken-and-egg effect here maybe). It should be extensible to new types, as there are many in the existing TDB ecosystem G, BMAG, TC, NT, GEIN, V0, VA, MQ, THETA, etc.. This is also how we chose to handle the different MQMQA parameters, as MQMG, MQMX, MQMZ and we called excess parameters for QKTO models QKT (I'm not saying we necessarily need to go with these particular names, I just mean to illustrate that I think it's sensible to give things names).

My sense is that the wider audience of Calphad software consumers and stakeholders are becoming increasingly interested in the availability of property models, so I think having this hook for extensibility going forward would be a good choice.

flotang-gtt / ThermML

pure phases (and functions, by proxy) #7