Open flotang-gtt opened 1 week ago
(I edited your post to number your points to make it easier to reply, hope that's alright.)
<property name='energy'></property>
I think <range>
should have an attribute that says it's a range over temperature. You could make it the only allowed attribute in the schema, but it would give a path to extending it in the future.
Thanks for the feedback.
OK, I have some contributions.
<property type="XY" >...*
structure is that there's no easy way to seperate the potential child elements, meaning that e.g. all different property type elements are orchestrated in the same place. I am not sure how much of an issue that is, as it strongly hinges on the number of property elements that are going to be added. Maybe having less layers of indirection is even helpful.It is easy to argue against following a strong set of principles initially in the development process. The value only becomes clear at scale. It is also very difficult, however, to start implementing a strong set of principles a long way down the road, since the retrofitting cost could be very high.
As I understood it, you are rightfully pointing out that, xs:id enforces global uniqueness. This is, as your example clearly shows, not desired. But if we were limiting it to the SystemComponentIDType type, the validation against that list is "unique", as long as we (which is not done, yet) assume uniqueness of SystemComponent elements - which I think can be done using the xs:unique element.
Enforcing 'type safety' that no user confuses a phase constituent and a system component should therefore be covered, as I don't think any value would have to allow both a SystemComponentIDType and a PhaseConstituentNameType. For each of these lists of "unique" (as far as that type encompasses) elements, no ambiguity exists.
Since the xs:IDREF type can only refer to xs:id, this necessitates to use xs:keyref - at least from my reading on this, there does not seem to be any downside? Am I overlooking something?
Using xs:unique and xs:keyref, or perhaps rather xs:key and xs:keyref sounds like a nice, elegant solution. I have never used this before, and wondered why.
I had a look at the XML Schema specifications, and it appears that the above-mentioned data types were added in version 1.1 of the specification. I have not used that version before, since several of the mainstream XML libraries did not implement version 1.1 when I last had a look. lxml is one example.
I am keen to support the recommended xs:key+xs:keyref solution, if we are confident that the tooling for using it is sufficiently available. I have not done a recent survey of this. A quick search about XML Schema 1.1 in vscode yielded this result, which is a bit concerning.
For Python at least, a supported solution for XML Schema 1.1 appears to be https://github.com/sissaschool/xmlschema
Sorry I'm a little late to getting feedback in.
I agree with Richard's sentiment that the software tools should be the medium to convert from legacy database formats to the XML format. I'll push a little harder against having direct support for legacy formats in the XML databases for two main reasons. First, it forces each software tool that implements the XML to add support for the various flavors of legacy formats in the new format - this is what we're trying to get away from and would limit the adoption and compatibility, and continue to breed some of the issues in the legacy formats. Second, my concern is that having legacy support in the new format means that the legacy formats will never be obsolete because users will just keep copy and pasting the legacy features around in XML files.
I agree with the discussion to remove any disambiguating prefixes. In addition to the points from @flotang-gtt, I also think that encoding the type of a thing (a system component or a phase name) in the name of that thing should be discouraged as a guiding principle.
In my opinion, one of the mistakes of the legacy formats is to enforce correctness though parsing and validating the values of strings. For example determining how many sublattices a PHASE
entry has in the TDB has by parsing the number, and then parsing the sublattice multiplicities themselves, rather than something declarative like just giving a list of sublattice multiplicities (where the number of sublattices is implied from the length of the list). Similar issue in the ChemSage DATs where you write explicitly the number of Gibbs energy equations that follow (and many other examples).
Unfortunately, I don't think there's a good enough cross-language support for anything much more structured and lightweight than expression strings. I lean more toward being permissive here, mostly out of necessity for the different capabilities of the various implementations regarding substitutions of other expressions, more complex expressions A*B*T
, or use of different, or custom state variables.
Some specific feedback:
function_of
in the expressions, at least in a way that implies that each expression can only be a function of one thing (I think we talked about R*T*LN(P)
before)h-s-cp
alongside integrated polynomial expressions. However, while I understand why something like <term coeff="80.0119918" fun="T^0" />
makes sense for particular implementations, I don't think I would recommend going this way for the user-facing schema. To me, this looks like the result of an intermediate representation that specific implementations would make on a user-facing expression 80.0119918*T^0 + -3546683.99888*T^(-2) + -240.275998928*T^(-0.5) + 491568369.44*T^(-3)
. If I'm understanding correctly, I think we should prefer <property type="G">
to <property type="free-energy">
. Specifically, I think we should keep the types as close to how they are expressed in papers and equations as possible (some chicken-and-egg effect here maybe). It should be extensible to new types, as there are many in the existing TDB ecosystem G
, BMAG
, TC
, NT
, GEIN
, V0
, VA
, MQ
, THETA
, etc.. This is also how we chose to handle the different MQMQA parameters, as MQMG
, MQMX
, MQMZ
and we called excess parameters for QKTO models QKT
(I'm not saying we necessarily need to go with these particular names, I just mean to illustrate that I think it's sensible to give things names).
My sense is that the wider audience of Calphad software consumers and stakeholders are becoming increasingly interested in the availability of property models, so I think having this hook for extensibility going forward would be a good choice.
I've been finally able to write up some suggestions. Feel free to challenge my decisions!
My suggestion encompasses:
'vendor' specific functions should be as easily as possible copied from existing databases. That means that we get TDD and ChemSage.dat specific function nodes. I would suggest these as transitive elements, to ease migration. My idea would be that the format should be able to ingress this type of function nodes, but at least in e.g. my implementation, I would only support the h-s-cp format for output. I think this decision will be fruitful both in situations for manual conversion/manipulation of databases, as well as testing and general acceptance (e.g. if one can demonstratively "show" that the function blocks are convertable in that regard, this is great for integrity.
I removed the 'cmpXX' id type for system components (and will like to follow up on this with some of the other IDs, depending on your feedback). I think the validation of these IDs can be realized identically, given with the changes that I implemented. I simply don't see the advantage of separating the name and ID of a system component, as I can't see a scenario where this would be warranted.
Had a look at how pycalphad employs the ascii<->AST conversion of TDBs, and especially for output. As long as there is no general support of e.g. SymEngine's style (and I also do not know if this allows lossless roundtrips?), I am hard pressed how to find a good-enough and fairly-straight conversion of the polynoms. MathML is basically unreadable, even though it is fully able to express the types of functions I encountered. XML-internal validation is likely out of the window anyways. I settled on the h-s-cp version, which goes somewhere in between everything - but it allows much less nesting or more complex expressions (e.g.
A*B*T
). Similar to how the Fact universe works, it also only allows functions to be referenced as linear factors in any term, not e.g. as exponent or argument to symbolic expressions. Very open to suggestions!Does anyone see an issue with using
<energy>
for the expression? It feels slightly off to me. Let me know if you have better ideas.Known issues (not mainline to this PR, feel free to add issues or spin off branches to discuss)