TUCAN-nest / TUCAN

A molecular identifier and descriptor for all domains of chemistry.
https://tucan-nest.github.io
GNU General Public License v3.0
22 stars 5 forks source link

Parser for TUCAN strings using a formal grammar #73

Closed flange-ipb closed 1 year ago

flange-ipb commented 1 year ago

Implement a parser for TUCAN strings which generates the corresponding NetworkX Graph. The parser should be generated from a grammar file using ANTLR4.

API changes:

Dependency changes

TUCAN format: See discussion below.

Tasks:

rapodaca commented 1 year ago

May I suggest writing a formal grammar as part of this issue or as a prelude, and to use a grammar language that is widely-understood such as W3C EBNF? This will aid those who might want to implement a parser in a different language and/or with different tooling. It also will make the grammar accessible to a wider audience for review and comment.

For details and ideas on how to implement, see this article. That link also contains information relevant to the sum formula construct.

schatzsc commented 1 year ago

That's definitely a good idea. The format you use in the "Grammar" section of the linked article - is that Rust-specific or a rather generalized variant of EBNF?

schatzsc commented 1 year ago

BTW, in addition to the node-associated property name/value pairs, we will also need fragment and possibly full graph-based attribute assignment. In particular the fragment (= connected group of node/atoms) property assignment will be required for handling charges.

rapodaca commented 1 year ago

It's W3C EBNF, which is technology-agnostic. The best-known language to use it in its specification is XML (thus the name).

There are a lot of parser generator packages, all of which use custom grammar languages. A grammar is the price of admission to using any of them, so creating a grammar introduces no new work.

But the format matters, especially in the long term. If the grammar is LL(1) (Balsa and InChI are, for example) then EBNF is a solid approach to producing a technology-agnostic specification.

However, certain parser generators don't necessarily read W3C EBNF directly. The one I included in the InChI Grammar package does, but there are other.

Now a possibly unpopular opinion - parser generators are overrated for LL(1) languages and will eventually become technical debt. Writing a recursive descent parser offers much more debuggability and maintainability, and far greater freedom than having to deal with auto-generated code.

Edit: the one exception a parser generated during language evolution. The InChI Grammar project uses a parser generator as part of a test suite to prevent regressions from being introduced as new rules are added. The intent is identical to unit tests.

I outline the technique here and refine it here. There's a lot of abstract theory behind this stuff, but most of it can be disregarded when building an LL(1) parser. That's what I try to do in those articles.

schatzsc commented 1 year ago

Turns out the "BNF playground" is really helpful for quick testing:

https://bnfplayground.pauliankline.com

but other than that, will need quite a while to digest this.

rapodaca commented 1 year ago

Yes, it's quite a fire hydrant of terms and concepts. BNF playground unfortunately doesn't accept EBNF. But EBNF Test does. Unfortunately, the latter in my hands chokes on certain grammars/inputs. But it's a great tool for figuring out EBNF. I've used it a lot.

schatzsc commented 1 year ago

I think I understand most of the syntax now but what's the purpose of these "control characters"(?!?!?):

"." ";"

as in

formula ::= "/" elemental ( "." elemental )*

rapodaca commented 1 year ago

Those are terminals, aka character literal.

A formula is separated from what preceded it by a forward slash. InChI allows "dot disconnection" of formula, which is the purpose of the ".". In later layers, disconnected entities are separated by a semicolon.

Some examples appear in strings.txt.

schatzsc commented 1 year ago

Ah, forgot about that since in TUCAN sum formula, this is not allowed - for example nickel(II) chloride with six "water of crystallization" which one would naively write as NiCl2 . 6 H2O although it really is [Ni(H2O)6]Cl2 would always be written as H12Cl2NiO6, as for example also ChemCalc does: https://www.chemcalc.org/?ionizations=&mf=NiCl2%28H2O%296

schatzsc commented 1 year ago

I think I got it working pretty well now - with one exception: How do I disallow an empty formula? Because that's something that must be avoided, that there is a "bond block" without a sum formula. Problem is the a? always allows also nothing to be present ...

String ::= version "/" formula version ::= "TUCAN=10" formula ::= c? h? other_elements other_elements ::= ar? as? at? he? o? zn? zr? c ::= "C" count? h ::= "H" count? he ::= "He" count? ar ::= "Ar" count? as ::= "As" count? at ::= "At" count? o ::= "O" count? zn ::= "Zn" count? zr ::= "Zr" count? count ::= nonzero digit* digit ::= "0" | nonzero nonzero ::= "1" | twoplus twoplus ::= "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

schatzsc commented 1 year ago

Related problem: if "molecule" is only monoatomic, there should be no "bond block" and for bond block definition ("tuple list") the definition itself is also relatively easy with bonds ::= "/" bond ( bond )* bond ::= "(" index "-" index ")" index ::= nonzero digit* but the two indices of a bond should not be the same and the index number should not be larger than the total number of atoms in the molecule, since in the canonical TUCAN we want continuous labelling of the nodes.

Or is this kind of check not part of formal grammer anymore, like in natural language where you can also construct formally correct sentences which are nevertheless nonsense, like the "invisible pink unicorn"?

schatzsc commented 1 year ago

Furthermore, I still have problems to define that a definition is not allowed to appear twice, for example

atoms ::= atom ( atom )* atom ::= "(" index ":" property ( "," property )* ")" property ::= a_name "=" a_value a_name ::= "MASS" | "RAD" a_value ::= nonzero digit*

allows MASS=value and RAD=value to appear multiple times in the "atom block"

rapodaca commented 1 year ago

How do I disallow an empty formula?

Here's one way. I'm not saying it's a good way, but it's a way.

formula       ::= carbon_then | hydrogen_then
                  /* etc. */
carbon_then   ::= c etc?
hydrogen_then ::= h etc?

Basically, you define the starting element for every possible formula, which is just north of 100. There are already that many element definitions, so it just doubles the amount of busy-ness :).

It's times like these that it helps to ask "at what cost"? Requiring non-empty formula is possible, but at what cost?

The cost here is a lot of complexity. Actually, the Hill formula/sum formula is a tightly-wound bundle of complexity if you think about it. The problem is all of those elements. So expressing formulas precisely is non-trivial to begin with. Disallowing empty formulas raises the bar still higher.

Or is this kind of check not part of formal grammer anymore, like in natural language where you can also construct formally correct sentences which are nevertheless nonsense, like the "invisible pink unicorn"?

I'd lump it into "semantics." Grammar/syntax defines the universe of possible strings. Semantics define what those strings mean. Some syntactically valid strings will have no semantic meaning. This is Ok and unavoidable.

Furthermore, I still have problems to define that a definition is not allowed to appear twice ...

Does the syntax require a particular ordering of MASS and RAD? If so, then that can be done the same way as formula. For example, Balsa's grammar requires properties to be defined in a specific order within bracket atom blocks. If any order is acceptable, I don't see ATM how to disallow duplicates.

Anyway, it's discussions like this that show how an early focus on tool-independent formal grammar can narrow down problem areas and expose bigger questions.

schatzsc commented 1 year ago

Well, one could require node attributes such as MASS and RAD to be given in a particular order (like by the alphbet) but there is no strict requirement to do so.

Based on the argument of complexity, I'd possibly also go that way to keep the formal grammar relatively simple and leave the "chemical interpretation" for later processing.

For example, once a TUCAN string with correct grammar is passed on by the parser, one could easily check for the empty sum formula, tuples in which both index numbers are identical or outside range (larger than number of atoms in the sum formula), fragment number higher than total fragments present, and MASS or RAD out of range to throw an error.

About MASS and RAD appearing multiple times, I would be tempted to either just take the first or final value or throw an error message.

Still need to think about the "graph" and "user/custom" blocks, then will post a "grammar file" here

flange-ipb commented 1 year ago

@schatzsc I remember we already discussed this as limitation of a parser: Should (and can?) a parser enforce the canonicalization rules?

My argument was that a parser for TUCAN is supposed to construct a valid graph object and this can also be achieved with a non-canonical TUCAN string. Examples:

(*) I absolutely agree that the set of possible node attributes (MASS and RAD at the moment) and their value types should be fixed because this layer is part of the identifier. This was incorrect in my initial proposal for this issue.

Multiplicity:

About MASS and RAD appearing multiple times, I would be tempted to either just take the first or final value or throw an error message.

This is also interesting in the other layers:

schatzsc commented 1 year ago

I think the formula layer should strictly adhere to the Hill system, which also seems to be easy to check with the formal grammar suggested by Richard, as it is really about counting the instances of the different elements

In the tuple layer, duplicates indeed do not have an impact, since NetworkX G.add_edge(e1, e2) will be the same as G.add_edge(e2, e1) and multiple identical calls to add_edge() will not change the graph. Also, as you already pointed out, the order of additions does not matter.

The node properties/attributes do not need to be sorted in any way in the non-canonical TUCAN string but we need a rule if an attribute is assigned multiple times differently - invalid and giving an error, first appearance to be considered as valid or last appearance valid?

flange-ipb commented 1 year ago

The node properties/attributes do not need to be sorted in any way in the non-canonical TUCAN string but we need a rule if an attribute is assigned multiple times differently - invalid and giving an error, first appearance to be considered as valid or last appearance valid?

We should keep that in mind until we have a reasonable parser implementation and then decide. From my experimentation with yacc I think all three options should be possible.

flange-ipb commented 1 year ago

I'd like to add a few examples to the limitations of a parser discussion. I started playing with another famous lexer/parser generator called ANTLR (which I may consider instead of PLY or lex/yacc). They also have a lot of nice grammar examples.

OpenSMILES grammar: This is more or less exactly the grammar formulated in section 2 of the OpenSMILES specification. Now, the specification names a few invalid examples, for instance "C-1CCCCC=1" for cyclohexene (section 3.4). The parser does not complain about it:

~/antlr/smiles$ echo "C-1CCCCC=1" | grun smiles smiles -tree
(smiles (chain (branched_atom (atom (aliphatic_organic C)) (ringbond (bond -) 1)) (branched_atom (atom (aliphatic_organic C))) (branched_atom (atom (aliphatic_organic C))) (branched_atom (atom (aliphatic_organic C))) (branched_atom (atom (aliphatic_organic C))) (branched_atom (atom (aliphatic_organic C)) (ringbond (bond =) 1))) (terminator \n) <EOF>)

~/antlr/smiles$ echo "something invalid" | grun smiles smiles -tree
line 1:2 mismatched input 'm' expecting {<EOF>, '\r', '\n', ' '}
(smiles (chain (branched_atom (atom (aromatic_organic s))) (branched_atom (atom (aromatic_organic o)))) m e t h i n g   i n v a l i d \n)

C grammar:

~/antlr/C$ echo "int pi = 3.14;" | grun C declaration -tree
(declaration (declarationSpecifiers (declarationSpecifier (typeSpecifier int))) (initDeclaratorList (initDeclarator (declarator (directDeclarator pi)) = (initializer (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (primaryExpression 3.14))))))))))))))))))) ;)

The parser sees no problems here, but a C compiler will tell us that we cannot assign a floating point number to an integer variable. Thus, after the syntactic analysis (parsing) the compiler performs a semantic analysis, which identifies these problems.

Conclusion: Not all rules of a "language" can and should be enforced by the grammar. Semantic analysis is required.

Examples of what could be part of a semantic analysis for TUCAN:

flange-ipb commented 1 year ago

Let me proceed with a grammar proposal for the Hill system formula. This is based on @rapodaca's approach. I used only a reduced set of element symbols, which later will be extended all 118 elements.

Notes:

W3C-EBNF:

formula ::= with_carbon | without_carbon
with_carbon ::= c h? b? n? o? p? s?
without_carbon ::= b? h? n? o? p? s?
h ::= "H" count?
b ::= "B" count?
c ::= "C" count?
n ::= "N" count?
o ::= "O" count?
p ::= "P" count?
s ::= "S" count?
count ::= TWO_TO_NINE | GREATER_THAN_NINE
TWO_TO_NINE ::= [2-9]
GREATER_THAN_NINE ::= [1-9] [0-9]+

BNF-Playground format:

<formula> ::= <with_carbon> | <without_carbon>
<with_carbon> ::= <c> <h>? <b>? <n>? <o>? <p>? <s>?
<without_carbon> ::= <b>? <h>? <n>? <o>? <p>? <s>?
<h> ::= "H" <count>?
<b> ::= "B" <count>?
<c> ::= "C" <count>?
<n> ::= "N" <count>?
<o> ::= "O" <count>?
<p> ::= "P" <count>?
<s> ::= "S" <count>?
<count> ::= <TWO_TO_NINE> | <GREATER_THAN_NINE>
<TWO_TO_NINE> ::= [2-9]
<GREATER_THAN_NINE> ::= [1-9] [0-9]+

ANTLR4:

grammar HillFormula;
// EOF will be moved to the end of the tucan rule in the TUCAN final grammar ...
formula : (with_carbon | without_carbon) EOF;
with_carbon : c h? b? n? o? p? s? ;
without_carbon : b? h? n? o? p? s? ;
h : 'H' count? ;
b : 'B' count? ;
c : 'C' count? ;
n : 'N' count? ;
o : 'O' count? ;
p : 'P' count? ;
s : 'S' count? ;
count : TWO_TO_NINE | GREATER_THAN_NINE ;
TWO_TO_NINE : [2-9] ;
GREATER_THAN_NINE : [1-9] [0-9]+ ;

With a few assumptions on the generated parse tree I already managed to extract the pairs of element symbol and count. Thus, an atom list can be constructed.

flange-ipb commented 1 year ago

Another defect would be to have a tuple with the same node index, for example (1-1). NetworkX doesn't offer any graph classes without self-loops and I guess they will become problematic in the canonicalization algorithm.

@JanCBrammer @schatzsc What should the behaviour be?

schatzsc commented 1 year ago

Let me proceed with a grammar proposal for the Hill system formula.

formula ::= with_carbon | without_carbon with_carbon ::= c h? b? n? o? p? s? without_carbon ::= b? h? n? o? p? s?

There is some discrepancy in the treatment of the hydrogens - the Hill system in the strict sense requires that "the number of carbon atoms in a molecule is indicated first, the number of hydrogen atoms next, and then the number of all other chemical elements subsequently, in alphabetical order of the chemical symbols. When the formula contains no carbon, all the elements, including hydrogen, are listed alphabetically."

https://en.wikipedia.org/wiki/Chemical_formula

However, there are also instances where hydrogen is always listed first ...

schatzsc commented 1 year ago

Another defect would be to have a tuple with the same node index, for example (1-1). NetworkX doesn't offer any graph classes without self-loops and I guess they will become problematic in the canonicalization algorithm.

What should the behaviour be?

* silently ignore

* raise a parser exception

* introduce an optional strict mode (I don't like that)

Such a "self-reference" is not allowed and should raise an exception, although I think the canonicalization algorithm will not be affected, since it only operates on the node properties, thus a "self-loop" will only be seen as an additional "bond". But it is still better not to allow this at all, since it is physically meaningless

flange-ipb commented 1 year ago

There is some discrepancy in the treatment of the hydrogens - the Hill system in the strict sense requires that "the number of carbon atoms in a molecule is indicated first, the number of hydrogen atoms next, and then the number of all other chemical elements subsequently, in alphabetical order of the chemical symbols. When the formula contains no carbon, all the elements, including hydrogen, are listed alphabetically."

https://en.wikipedia.org/wiki/Chemical_formula

However, there are also instances where hydrogen is always listed first ...

Can you please check that in my draft PR? There are already tests for the grammar, i.e. testing what can be parsed or not. I'd be happy to have more examples there. The ANTLR grammar file with all elements can be found here, but it's not complete yet.

schatzsc commented 1 year ago

Somehow it just shows me that the test has passed but I cannot easily see input and results?!?

flange-ipb commented 1 year ago

You could add something wrong to the parameter list of test_can_parse_sum_formula and test_cannot_parse_sum_formula and watch them failing. For instance

@pytest.mark.parametrize(
    "sum_formula",
    [
        "CHCl3",
        "ClH",
        "H2",
        "C123456789Zr987654321",
        "HCl",
    ],
)
def test_can_parse_sum_formula(sum_formula):
    _parse_sum_formula(sum_formula)

will fail with

...
>       raise ParseCancellationException(error_str)
E       antlr4.error.Errors.ParseCancellationException: line 1:1 extraneous input 'Cl' expecting <EOF>
E       HCl
E        ^^
schatzsc commented 1 year ago

I'm still not sure about the order of H in case of no carbon atoms. It seems to be a bit weird although formally correct that ClH is correct but HCl not, which is due to the alphabetical order, as HNO3 would be more natural. Interestingly, sulfuric acid would also be according to Hill HO4S.

Another important test would be element names which start with the same letter but then are one- vs. two-letter:

"A list of formulae in Hill system order is arranged alphabetically, as above, with single-letter elements coming before two-letter symbols when the symbols begin with the same letter (so "B" comes before "Be", which comes before "Br")"

https://en.wikipedia.org/wiki/Chemical_formula#Hill_system

flange-ipb commented 1 year ago

I'm still not sure about the order of H in case of no carbon atoms. It seems to be a bit weird although formally correct that ClH is correct but HCl not, which is due to the alphabetical order, as HNO3 would be more natural. Interestingly, sulfuric acid would also be according to Hill HO4S.

I'm just sticking to these rules. :wink:

How could this "human readable" formula be expressed? It's not "take C first, then H, then all other elements ordered by their atomic number".

Another important test would be element names which start with the same letter but then are one- vs. two-letter:

"A list of formulae in Hill system order is arranged alphabetically, as above, with single-letter elements coming before two-letter symbols when the symbols begin with the same letter (so "B" comes before "Be", which comes before "Br")"

This is exactly what test_element_order_is_correct() is doing. I could add another test that shuffles the list a bit and assert that the parsing fails.

flange-ipb commented 1 year ago

How could this "human readable" formula be expressed?

German Wikipedia explains it: sorting by electronegativity.

flange-ipb commented 1 year ago

Running into troubles in the roundtrip tests (molfile -> graph -> TUCAN (using serialize_molecule(canonicalize_molecule(m))) -> graph (parser invocation)) using the existing set of test molfiles. At the moment charges are serialized as node properties, for instance the TUCAN for the TNT molfile is

C6H3N3O6/(1-4)(2-5)(3-6)(4-7)(4-8)(5-7)(5-9)(6-8)(6-9)(7-10)(8-11)(9-12)(10-13)(10-14)(11-15)(11-16)(12-17)(12-18)/(10:chg=1)(11:chg=1)(12:chg=1)(14:chg=-1)(16:chg=-1)(18:chg=-1)

I'm not tempted to add 'chg' to the node_property_key rule and node_property_value would then also need negative values. The quick fix would be to remove chg from the list of serializable node properties in serialization.py. @schatzsc mentioned to introduce changes on fragments in the future, so those wouldn't serialize in the node properties anyways.

schatzsc commented 1 year ago

We should definitely not go down this "sorting by electronegativity" road. Let's stay with the Hill system and decide whether to always put the hydrogen before all other non-C/non-H elements or sort it by alphabet when no carbon is present.

schatzsc commented 1 year ago

In contrast to mass and rad which are strictly local properties and thus always associated with a particular node (and need to stick to it), chg is a problem since charges can be delocalized. The current idea is to take the local charges (chg in the molfile) and sum them up for the complete connected molecular fragments, then store them in a separate "fragment properties block", as down in https://github.com/schatzsc/TUCAN_playground in def charge_count(): However, the information on which individual charges were associated with which nodes/atom is lost in that process and cannot be recovered. Therefore, a test that checks that molfile -> graph -> TUCAN string -> graph -> molfile gives the same result is not possible for molecules that require the chg attribute. However, the sum formula and tuple (bond) blocks should still be the same. Therefore, I suggest to drop the chg attribute (and maybe also all the others) and just check the sum formula and topology does not change in that "round trip"

JanCBrammer commented 1 year ago

decide whether to always put the hydrogen before all other non-C/non-H elements or sort it by alphabet when no carbon is present.

What's the status on this decision?

flange-ipb commented 1 year ago

What's the status on this decision?

Keep the Hill formula rules. This is consistent with the existing implementation in serialize_molecule.

schatzsc commented 1 year ago

As with many things in cheminformatics and beyond, use is somewhat inconsistent and you can have long discussion what is more "natural" but yes, let's stay with published Hill rules.

rapodaca commented 1 year ago

To be clear, the path forward would be to encode the formula layer using Hill order. Carbon-free formulas list symbols alphabetically and carbon-containing formulas start with C, then H, then list the rest alphabetically. The only time H appears first is when (1) there is no C; and (2) there are no lexicographically smaller elements.

This complexity would be handled at the level of syntax (rather than semantics).

Yes so far?

I ask because this was a stumbling point for me in learning TUCAN. I thought that elements were mapped to indexes in Hill order, when they were actually mapped in atomic number order.

To actually perform the mapping an implementation needs to read the Hill formula, sort by ascending atomic number, then map. I suspect the burning question on the mind of anyone working with this format would be: why not just list the element symbols by ascending atomic number?

If changing the formula ordering were on the table, a case could be made to use atomic number order for that reason alone, not to mention syntactic simplification. It may look "unnatural" but would align with how TUCAN is set up now.

schatzsc commented 1 year ago

Thank you very much for your input - always highly appreciated.

Increase of node index with atomic number is a key element of TUCAN and helps for example with analyzing both metal coordination environments, as these tend to have the highest index (with the exception of Li and Be) and H always the lowest.

Is also used in some experimental features not in the official repository so far such as tautomer detection and implicit to explicit H pre-processor check, where CSD structures are taken, "pruned" of the lowest index non-bridging Hs, then Hs re-added by the pre-processor and Tucan strings before/after compared, if pre-processor works both have to be identical, which is acutally the case with approx. 80% of the non-polymeric, non-disordered organic subset (about 425.000 out of a total of 1.2 mio. structures) - and the 20% which are not correctly handled yet are mainly boron clusters and other main group element compounds with unusual "valences" (= coordination numbers) where algorithmic addition becomes difficult.

However, the actual order of elements in the sum formal does not matter as long as you can easily separate the individual elements and their stoichiometric coefficients. Fragment from my "TUCAN playground":

def graph_from_tucan_string(): input_blocks = input_string.split("/") sum_formula_string = input_blocks[0] element_count = {k: int(v) if v else 1 for k,v in re.findall(r"([A-Z][a-z]?)(\d+)?", sum_formula_string)} n = 1 for e in ELEMENT_PROPS: if(e in element_count): for i in range(0,element_count[e]): graph.add_node(n) n += 1

So what is done is kind of a "reverse lookup" where the ELEMENT_PROPS dictionary is scanned and then for each element checked if it was present in the sum formula and then a number of nodes added equivalent to the stoichiometric coefficient stored in element_count

Since the order of the addition is governed by the order of elements in ELEMENT_PROPS, which by itself is ordered by increasing atomic number https://github.com/TUCAN-nest/TUCAN/blob/main/tucan/element_properties.py, it does not matter how they are ordered in the element_count, since the loop just "pulls out" whatever element is up for addition now.

Therefore, although ClH and H2O4S look strange to me, I guess the simplest thing is to just stick to the Hill order, but also more for consistency. Not sure what happens with element_count construction if there are multiple instances of an element in the sum formula such like HOS4H but that should not be allowed anyway.

JanCBrammer commented 1 year ago

why not just list the element symbols by ascending atomic number?

If changing the formula ordering were on the table, a case could be made to use atomic number order for that reason alone, not to mention syntactic simplification. It may look "unnatural" but would align with how TUCAN is set up now.

+1 for the proposed simplification.

schatzsc commented 1 year ago

I'd like to strongly suggest to stick with the system that has been in place for 120 years now and which is the Hill order. It's what most chemists are used to, it is not that terribly difficult, and internally, it does not matter at all due to the "reverse look-up" via the ELEMENT_PROPS dictionary as outlined above.