Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases

Creative Commons Attribution 4.0 International

73 stars 37 forks source link

Add SMILES property #368

Open JPBergsma opened 3 years ago

JPBergsma commented 3 years ago

Do we want to allow the use of smiles string in the field chemical_formula_descriptive ? The SMILES notation for molecular formulas uses '#' and '$' to indicate triple and quadruple bonds, the characters '/' and '\' to indicate whether the bonds are in the cis or trans orientation and '@' and '@@' to differentiate enantiomers. Finally, ring numbers with more than one digit have to be preceded by a '%' sign.
It, therefore, seems reasonable to me to add these to the allowed characters for the chemical_formula_descriptive field.

Or do you think we should add a separate SMILES field instead?

merkys commented 3 years ago

Or do you think we should add a separate SMILES field instead?

I would suggest so. chemical_formula_descriptive has its own purpose and semantics, and they should not change.

rartino commented 3 years ago

@JPBergsma the topic of SMILES have come up a few times and a standardization for SMILES use in OPTIMADE would likely be very useful. If you are familiar with SMILES usage, could you perhaps describe a few "search scenarios" of SMILES data? E.g., what would you be searching for? How do you envision such a search could be expressed, etc.?

JPBergsma commented 3 years ago

Sorry, I did not read the specification for chemical_formula_descriptive well enough the first time and I overlooked that it is already defined by the IUPAC's Nomenclature. I, therefore, had already closed the issue but unfortunately, I did not have sufficient privileges to remove it.

It would indeed be better to add a separate field for the SMILES string, although we could also think about other ways to add topological information, as smiles strings cannot be compared directly.

rartino commented 3 years ago

(I took the liberty of editing your issue title to match - feel free to adjust it)

JPBergsma commented 3 years ago

First of all, defining the topology of a molecule allows you to distinguish between molecules with the same elemental composition but a different structure. Perhaps the current IUPAC definition is also able to do so, but via the link in optimade.rst https://www.qmul.ac.uk/sbcs/iupac/bibliog/blue.html I only found information about how to name chemical compounds and not how to write the structural formula. (IUPAC did define the InChI format which does contain the molecular structure, but that is different from the example fields in OPTIMADE.)

Ideally, having the structural data of a molecule would also allow you to find molecules with a mostly similar structure but some small differences. For example, a structure where a hydrogen atom has been replaced by a methyl group or a bromine atom has been replaced by a chlorine atom. While this would be quite useful, it may be difficult to implement such a search.

I am not sure whether SMILES is the best option for this. It has the advantage that the strings are relatively human-readable but multiple SMILES strings can encode for the same molecule. So you first have to convert the string to a structure before you know whether they are identical, or you have to agree on which algorithm to use to generate SMILES strings.

There are other ways to store the structure of a molecule, like InChI, and another option would be to use a connectivity matrix.

JPBergsma commented 2 years ago

During OMDI I talked with someone from the Ocelot database. Their database has crystal structures of organic molecules. They use SMILES strings to search to select structures as one structure can have many names and a simple structural formula is not descriptive enough. So I think there would definitely be a use for a SMILES field within Optimade. In the original SMILES string, there could be multiple strings encoding the same molecule. Therefore they first convert the string to a structure and then convert it back to a smiles string with a known algorithm so the SMILES strings are guaranteed to be the same. They also match chemical groups, for example when I searched for benzene, they also returned molecules containing a benzene ring. They have a git reposit, so perhaps we could reuse some of their code to implement this in the Optimade python tools.

merkys commented 2 years ago

I support standardizing a separate property for SMILES. However, there are some issues related both to its definition and usability.

There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.
The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.
SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.
SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

rartino commented 2 years ago

There is also the question how we handle this type of extension into string-like complex properties in the OPTIMADE filter language (and otherwise in our type system). Far back I wrote up my thoughts on this here: https://github.com/Materials-Consortia/OPTIMADE/issues/157#issuecomment-554686285

But, in short, we probably need to have some way to tell a normal string and a smiles string apart since they will have different comparison semantics.

JPBergsma commented 2 years ago

@Merkys

There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.

1 The OpenSmiles standard is definitively an option. It seems practically the same as the SMILES definition on the Daylight website so if necessary we could switch. Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers. At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client and the server would need to do some processing anyway to handle queries using SMARTS. Internally the server may also store structure information in a different format than SMILES so it would need to do a conversion anyway. Another question would be whether we want to canonicalize the output.

SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.

3 I think it will indeed be necessary to generate a molecular graph. Although a preselection could be made using fingerprinting, for example, by looking at the atom composition of the searched fragment, or by comparing which common structural elements are present.
This way the full structures would only need to be compared for a relatively small number of structures.

SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Standard InChI has the limitation that tautomers have the same InChI code. In a laboratory setting, it is usually not possible to separate the tautomers so this would not be a problem. But in computational chemistry, the timescales are usually so short that no conversion takes place. There is an extension for this so I think we should implement it if we would want to use InChI. That way each InChI should belong to exactly one structure. Personally, I find InChI less intuitive and human-readable than SMILES, so simply typing in an InChI code would be more difficult than with SMILES.

A final option would be to use a molecular graph for searching.

@rartino

Unless we decide on a canonicalization algorithm, the SMILES field should indeed not have the string type as a direct comparison of uncanonicalized SMILES strings is not possible.

merkys commented 2 years ago

(For brevity, I am not citing and explicitly responding to @JPBergsma sentences with which I completely agree)

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers. At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client

Yes, this makes sense.

and the server would need to do some processing anyway to handle queries using SMARTS.

Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

Another question would be whether we want to canonicalize the output.

Preferably yes.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

This would be nice, but again, all providers should use conventions as similar as possible.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Strictly speaking, this is true only if providers manage to use InChI library without modifying its code.

JPBergsma commented 2 years ago

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

the server would need to do some processing anyway to handle queries using SMARTS. Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

merkys commented 2 years ago

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

Sorry, I misparsed the term "extension".

I believe the SMARTS were originally described by Daylight. I am not sure about the state of other parallel SMARTS specifications, though.

the server would need to do some processing anyway to handle queries using SMARTS. Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

Yes, that is true.

merkys commented 2 years ago

Looking back at my discussion checklist, I think we at least agree on using OpenSMILES. However, other issues still need more discussion. My suggestions to speed up the introduction of SMILES property would be the following:

Server-provided SMILES need not to be canonical. Since there are many canonicalization methods and we probably cannot select one from them all, servers should just provide any SMILES representation of a structure. Then it is up to client to canonicalize them or not.
Comparisons of SMILES with other SMILES or strings must not be supported, as well as querying. We may introduce this support later.

This would make the SMILES property a descriptive one. Thus, the client will be able to retrieve SMILES values alongside other structural data, but would not be able to query on them.

For dealing with inorganics I could propose adhering to Quirós et al. 2018 (disclaimer: I am one of the authors), but this would not be convenient for providers using their own conventions, or producing SMILES by Open Babel or some other software.

JPBergsma commented 2 years ago

I agree on point 1, that databases are allowed to use their own canonicalization method.

Part of the reason to implement this though is to make it easier to search for organic molecules, as these can have the same chemical formula. For that to work, it should be possible to search for SMILES strings. This should not be that difficult to implement. The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice. The generated SMILES string can then be used for a simple string comparison with the SMILES fields in the database. Searching for fragments can still be added later on if necessary.

Quirós et al. 2018 could indeed be useful for describing metal complexes and such, as far as that they are not covered by the OpenSMILES standard.

rartino commented 2 years ago

Aren't we landing in that we should just standardize a SMILES field to be a normal OPTIMADE String which is specified to contain an OpenSMILES representation of the implementer's choice (much like chemical_formula_descriptive, which had similar normalization issues with competing standards), and then put the requirement on MUST or SHOULD level that all partial string matching filter operators are supported?

(I realize it was said above that it cannot be a String because uncanonicalized SMILES "cannot be compared", but, the same issue technically holds for chemical_formula_descriptive and we were ok with that...)

The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice.

I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

merkys commented 2 years ago

I agree that we can define SMILES as a regular OPTIMADE String with all string handling operations. Thus for the time being "O" != "[OH2]" is true as these strings are not equal, despite molecules with SMILES of O and [OH2] being actually the same.

So it seems we have consensus on the most of SMILES-related issues. ~~Let us prepare a PR then?~~ I have opened #392 from the consensus (IMO) we achieved here.

JPBergsma commented 2 years ago

If we define the SMILES field as a normal OPTIMADE string we should define the canonicalization method that should be used with OPTIMADE. Otherwise, it does not make sense to put the requirement on MUST or SHOULD level for the (partial) string matching filter operators, as one molecule can have multiple different SMILES strings.

One of the main reasons to implement the SMILES notation is to enable searching on molecular structures. Without this, sharing data on structures composed of individual molecules would be inefficient. More structures would need to be returned than needed, since you can only select on the chemical formula and many molecules can have the same chemical formula. I can imagine that for people who want to set up a database with molecular structures, not being able to search for molecules could be a reason to not use OPTIMADE.

 I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

The conversion would be needed if we do not agree on a canonicalization method. If you start generating the SMILES string from different atoms within a molecule, you would get a valid SMILES string for each starting atom, but they would all be different. Because of this, you can not do a simple string comparison to see if two SMILES strings refer to the same molecule. So you would first need to generate the structure from the SMILES string and then turn it back into a SMILES string with the same method that has been used to generate the SMILES strings in the database.

There are already python packages that can convert SMILES strings into structures and back. RDkit can do this, and it also guarantees the created SMILES string is canonicalized, i.e. you will always get the same string regardless of SMILES string you originally used.

A simple way to make your structures with SMILES strings searchable is to covert your SMILES into structures and then back into SMILES strings with RDkit. This way, you can be sure all strings have the same canonicalization method. If you do the same for any SMILES string that is entered as a search term. It is guaranteed that two structures are the same if the SMILES strings match and are different when they do not match. This means a simple string comparison, which most database backends should be able to do quickly, is sufficient to find identical molecules.

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

merkys commented 2 years ago

I agree that to implement reliable querying of exact structures we have to define canonicalization method. This will most likely boil down to choosing common software package to produce canonical SMILES for OPTIMADE output, be it RDKit, Open Babel or something else. In addition, if we want to support inorganics, all providers will have to select a common set of rules to describe them.

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Thus I very much would want to avoid forcing all the providers to use the same canonicalization method. I am afraid that instead being a useful descriptive property, SMILES would be supported by only a few providers.

merkys commented 2 years ago

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

JPBergsma commented 2 years ago

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Indeed, matching substructures is much more complicated and beyond the scope of PR#392.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Screening would be less efficient for both the client and the server: The database would have to send the SMILES strings of many structures to the client. (based on the elements in the SMILES string/molecule, some preselection can be made) Then the client would have to convert all these SMILES strings to structures so that they can be compared with the molecular structure that the client is searching. Once the SMILES strings have been found that encode for the desired molecule, The client would have to send a query to the database for the records with these SMILES strings. And the database would, then, have to loop over all SMILES values to check which contain these SMILES strings, before returning the desired structures. This takes much more computing time than the method I suggested. I am therefore convinced that we should not force databases to use the screening method you described.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list. And it would off course also be possible to expand the queryability of strings in a list, although that's best left for a different PR.

merkys commented 2 years ago

Screening would be less efficient for both the client and the server: The database would have to send the SMILES strings of many structures to the client. (based on the elements in the SMILES string/molecule, some preselection can be made) Then the client would have to convert all these SMILES strings to structures so that they can be compared with the molecular structure that the client is searching. Once the SMILES strings have been found that encode for the desired molecule, The client would have to send a query to the database for the records with these SMILES strings. And the database would, then, have to loop over all SMILES values to check which contain these SMILES strings, before returning the desired structures. This takes much more computing time than the method I suggested. I am therefore convinced that we should not force databases to use the screening method you described.

In my understanding screening is simpler. A generic screening workflow:

Client retrieves all information required for screening;
Client performs screening locally to find entry IDs of interest;
Using entry IDs client retrieves full entry records from the database.

Thus for SMILES there is no need to query the database on SMILES values, ever. As for converting SMILES to structures locally, to perform the screening locally a client most likely will use RDKit or Open Babel or any other cheminformatics toolbox.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

As I have written before, there are many SMILES canonicalization methods. However, they are rarely well-defined. I am not in favor of writing "SMILES canonicalization MUST be done by RDKit" in the specification, because we will have to put down the specific RDKit version (other versions may change the canonicalization), even specific versions of its dependencies if we are interested in providing really reliable service. And this, I believe, opens yet another can of worms. So unless we find a well-defined SMILES canonicalization method supported by more than one (ideally, >2) cheminformatics toolboxes, I do not think we can enforce it.

There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list. And it would off course also be possible to expand the queryability of strings in a list, although that's best left for a different PR.

I agree that substring comparisons are not very useful indeed. I have opened issue #393 to discuss the expansion of the queryability of strings in a list. But I would stick to single-string SMILES representation due to its simplicity unless we mandate strict canonicalization.

It would be great to hear the opinions of other developers interested in this property.

rartino commented 2 years ago

It indeed seems a problem for the canonicalization approach, if there isn't any good standard to use for canonicalization.

But, I also find it quite abstract what kind of "high level searches" we are talking about here that are connected to the SMILES field specifically, as opposed to our other structural fields.

@JPBergsma could you try to come up with a few examples of "dream" searches that you envision possible if one does the on-the-fly conversion from SMILES to structure that you propose? Feel free to just improvise a filter syntax.

merkys commented 2 years ago

But, I also find it quite abstract what kind of "high level searches" we are talking about here that are connected to the SMILES field specifically, as opposed to our other structural fields.

I understand that currently we are mostly talking about identical match operation. If the canonicalization becomes a MUST, then this reduces to simple string comparison (= and != operators).

Early in the discussion fuzzy matching was discussed. There is SMARTS query language which could be employed to search for substructures, for example:

smiles CONTAINS SMARTS "c1ccccc1" could be used to find structures having benzene rings (CONTAINS SMARTS is a "dream operator" here)
smiles SMARTS "c1ccccc1" could be used to find structures that are exactly benzene rings (SMARTS being "dream operator"). Not sure whether SMARTS language has provisions for exact match, though.

merkys commented 2 years ago

There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.

There is an interesting new development called Dialect. It is an attempt to fix and extend the commonly used SMILES standard. I am not suggesting to switch to it right away as it is in its early stages of development, just linking for reference.

JPBergsma commented 2 years ago

In my understanding screening is simpler. A generic screening workflow:

Client retrieves all information required for screening; Client performs screening locally to find entry IDs of interest; Using entry IDs client retrieves full entry records from the database.

That is also possible. It however does require sending more information (the ID's) to the client. If the SMILES field is indexed, it would not take extra time to use the SMILES strings instead.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

If we treat the SMILES string as a plain string, we would indeed need to agree upon a canonicalization method. This would be the most efficient. We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

I quickly looked, but I could not find which exact method RDkit uses. It is a shame that there is no well adopted canonicalization method, even though a canonicalization method was already defined with the original SMILES standard.

Dialect is a nice initiative, but I am a bit worried that we would get just another standard that is not widely adopted. There is for example already SYBYL which is another SMILES derived format to specify chemical structures, it however does not define a canonicalization method. So it would not solve our problem.

@rartino As Merkys mentioned you mostly want to find structures with specific molecules, some libraries like rdkit will also generate tautomers for a structure. Finding molecules with a certain substructure would be great, but I do not think this can be done efficiently with the backends that are currently used(SQL, MongoDB, elastic search).

merkys commented 2 years ago

In my understanding screening is simpler. A generic screening workflow:

Client retrieves all information required for screening; Client performs screening locally to find entry IDs of interest; Using entry IDs client retrieves full entry records from the database.

That is also possible. It however does require sending more information (the ID's) to the client. If the SMILES field is indexed, it would not take extra time to use the SMILES strings instead.

I do not see a problem in retrieving IDs. Most of the time clients will want IDs and versions/modification timestamps for provenance anyway.

I agree that this is more computing time than just storing canonicalized SMILES in provider databases. However, all providers have to agree on the same canonicalization method and this has to be enforced on MUST level (otherwise it cannot be trusted). And I do not believe this is feasible.

If we treat the SMILES string as a plain string, we would indeed need to agree upon a canonicalization method. This would be the most efficient. We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

Right, but canonicalization methods will affect matching. A trivial example is aromatized vs. kekulized aromatic rings. If the provider does not canonicalize these, then kekulized input will only match kekulized molecules in the database.

I quickly looked, but I could not find which exact method RDkit uses. It is a shame that there is no well adopted canonicalization method, even though a canonicalization method was already defined with the original SMILES standard.

AFAIR, this method has many deficiencies. Maybe this is the reason it has not been adopted widely.

Dialect is a nice initiative, but I am a bit worried that we would get just another standard that is not widely adopted. There is for example already SYBYL which is another SMILES derived format to specify chemical structures, it however does not define a canonicalization method. So it would not solve our problem.

Sure, but I like the idea. Dialect does not seem to aim to reinvent a SMILES-like notation or extend it, but clarify the obscure parts which are often interpreted differently.

merkys commented 2 years ago

Introduced _cod_smiles in the COD OPTIMADE implementation. It is a plain string, just as suggested in #392. String-based queries are not implemented yet, though.

rartino commented 2 years ago

@JPBergsma

We could however let the database convert the SMILES string, that is entered in the search, to a SMILES string in the canonicalization format of the database. That way, the database provider would only have to compare strings, and there does not need to be agreement on the canonicalization method.

If we can do canonicalization, this is indeed the design to go for (to enable this kind of cheap on-the-fly translation + optimized backend query has been a guiding principle for other fields, which is the reason we do not enforce support for partial string matching on such fields...)

However, in absence of a good explicitly formulated canonicalization, I have trouble seeing a solution beyond the chemical_formula_descriptive approach where each database does what makes the most sense to them.

Nevertheless, if the dream is to query on substructures, maybe this can be done in another way than as a quasi-string-operation on a single SMILES field? Could we have something like an optional SMILES_substructures which is a list of all identifiable substructures? It could then be queried like: SMILES_substructures HAS "c1ccccc1".

(Since the implementation knows the specific (quasi-)canonicalization used by the backend, it may be able to translate this query to a partial string matching on the backend SMILES field.)

merkys commented 2 years ago

@rartino

Nevertheless, if the dream is to query on substructures, maybe this can be done in another way than as a quasi-string-operation on a single SMILES field? Could we have something like an optional SMILES_substructures which is a list of all identifiable substructures? It could then be queried like: SMILES_substructures HAS "c1ccccc1".

The number of all possible substructures times all possible representations is just too large for anything but the most trivial molecules. Narrowing this set down to an arbitrary subset increases the risk of false-negatives, and this is something I very much would like to avoid.

rartino commented 2 years ago

@merkys

The number of all possible substructures times all possible representations is just too large for anything but the most trivial molecules.

Indeed. My intent was not for the list to contain "all possible representations" but rather that we could find some standardization for substructures. I suppose you could argue that if there is no canonicalized form for the full SMILES, then there also is none for substructures. Nevertheless, maybe one could refer to some standard list/database of substructures and say something along the lines of ~ "substructures SHOULD only be listed if the are present in list X, and, if present, MUST use the precise SMILES in that list"?

On the other hand - I suppose we could make the same kind of canonicalization for the full SMILES formula? Not standardizing the full formula, but say that all substructures present in a list must be on the form in the list?

Narrowing this set down to an arbitrary subset increases the risk of false-negatives, and this is something I very much would like to avoid.

Well, given that the detection of substructures is subject to a possibly imperfect detection algorithm with a certain level of subjectivity in what is regarded as a bond, etc., I don't think it is technically possible to eliminate false-negatives. (But perhaps you mean that if I have identified substructure Y, then there should be no false-negative if you are also looking for that substructure.)

merkys commented 2 years ago

@rartino

Indeed. My intent was not for the list to contain "all possible representations" but rather that we could find some standardization for substructures. I suppose you could argue that if there is no canonicalized form for the full SMILES, then there also is none for substructures. Nevertheless, maybe one could refer to some standard list/database of substructures and say something along the lines of ~ "substructures SHOULD only be listed if the are present in list X, and, if present, MUST use the precise SMILES in that list"?

This seems doable. In particular, there are readily used Open Babel fingerprints which are binary strings where every bit tells existence of a certain substructure (i.e., 1 - has benzene ring, 0 - does not have benzene ring). Open Babel-defined substructure set is greatly refined towards organics, but so is SMILES, I guess.

On the other hand - I suppose we could make the same kind of canonicalization for the full SMILES formula? Not standardizing the full formula, but say that all substructures present in a list must be on the form in the list?

I am afraid this would in the end lead us to developing our own SMILES canonicalization algorithm and its implementation. This might be a feat on its own.

Well, given that the detection of substructures is subject to a possibly imperfect detection algorithm with a certain level of subjectivity in what is regarded as a bond, etc., I don't think it is technically possible to eliminate false-negatives. (But perhaps you mean that if I have identified substructure Y, then there should be no false-negative if you are also looking for that substructure.)

I meant that even for quite simple chemical structures the number of SMILES representations is quite large. If a provider decides to limit it, querying with any of the excluded representations will yield a false-negative. Thus we need canonical representations of the substructures.

On a slightly separate topic, it would be interesting to check the support of SMARTS. If these are supported by several cheminformatics packages, we may just introduce an OPTIMADE extension endpoint for querying chemical structures using SMARTS.

merkys commented 2 years ago

OK, I have checked the SMARTS support and it seems that at least Indigo Toolkit, Open Babel and RDKit support it. All of these packages are F/LOSS and have official APIs for both C and Python, so I guess most of the providers will be comfortable with using some of these packages.

Given that, I would propose abandoning SMILES_substructures property in favor of means to filter using SMARTS. There is more than one way to do so, but none of them seems quite elegant for me:

Introduce OPTIMADE extension endpoint, something like /optimade/extensions/smarts/<entry type>/<SMARTS> where <entry type> currently would be structures and <SMARTS> is the SMARTS query.
Introduce SMARTS operator in filter language to support queries like smiles SMARTS "<SMARTS>".
Redefine CONTAINS operator in filter language for SMILES-typed properties.
Introduce a URL query parameter smarts.

All of 2-4 would require substantial modifications in the specification (and they do not go in easy). 1 and 4 would not allow combination of search criteria (no AND/OR). So no silver bullet here.

Lastly, I wonder whether the syntax and interpretation of SMARTS is the same across these three packages. As I do not use SMARTS much, I cannot comment, however.

rartino commented 2 years ago

@merkys This is indeed a bit tricky because we didn't so far really consider how to support other query languages alongside our own.

At first, it seems preferable to embed these queries into our own query language (so, not your option (1) or (4)), so - as you conclude - it is possible to express things like " OR nelements=3".

However, I have some reluctance to smiles SMARTS "<SMARTS>", because strictly speaking, SMARTS is a query language for 'structures', it isn't inherently connected to the smiles field. One could imagine a database that does not populate the smiles field but still can be queried with SMARTS.

Furthermore, trying to think ahead, this issue is bound to come up again with other query languages; and I'm not sure we want to try to embed everything into our own language.

So, maybe the most uncomplicated solution is to see this as an alternative "filter". I.e., your option (4) but maybe naming the parameter filter_smarts. We can then say that supporting multiple different filter-type URL parameters is OPTIONAL, but if supported the construct MUST be interpreted as the intersection of the filter results. This at least supports an outermost "AND" combination (since the outermost "OR" can be done as consecutive queries).

merkys commented 2 years ago

@rartino

However, I have some reluctance to smiles SMARTS "<SMARTS>", because strictly speaking, SMARTS is a query language for 'structures', it isn't inherently connected to the smiles field. One could imagine a database that does not populate the smiles field but still can be queried with SMARTS.

Agree, this does not look elegant.

Furthermore, trying to think ahead, this issue is bound to come up again with other query languages; and I'm not sure we want to try to embed everything into our own language.

Completely agree.

So, maybe the most uncomplicated solution is to see this as an alternative "filter". I.e., your option (4) but maybe naming the parameter filter_smarts. We can then say that supporting multiple different filter-type URL parameters is OPTIONAL, but if supported the construct MUST be interpreted as the intersection of the filter results. This at least supports an outermost "AND" combination (since the outermost "OR" can be done as consecutive queries).

Agree with every word here!

So it seems we are arriving at these properties (all OPTIONAL):

smiles: string representation of structure contents, without an attempt to canonicalize; SHOULD support all string query features;
smiles_substructures: a list of strings representing sensible substructures in the structure (not yet sure how to express them in canonical manner - to be discussed).

Plus filter_smarts URL parameter to select chemical structures matching SMARTS. My worry that SMARTS is not strictly defined language remains.

How about this? Still we have some homework to do regarding smiles_substructures and filter_smarts. Of course we may as well ignore the possible ambiguity and leave it for the future.

merkys commented 2 years ago

In today's Web meeting @JPBergsma advocated for specific handling of string comparisons on smiles property: the provider may optionally canonicalize the value queried with = or CONTAINS operator before performing the actual string comparison. @rartino, @ml-evs and I advocated against specific handling of queried values, as none of the string-valued properties currently in the spec mandates/suggests any specific queried value treatment.

I would be happy to include filter_smarts in #392. Could the advocates for smiles_substructures property provide a description for it, if they believe it is not superseded by filter_smarts URL parameter?

JPBergsma commented 2 years ago

I think it would be best to create a separate issue/PR for the filter_smarts. Adding it to the current PR could again lead to a lot of new discussion, which would postpone the acceptance of the SMILES field.

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server. In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

merkys commented 2 years ago

I think it would be best to create a separate issue/PR for the filter_smarts. Adding it to the current PR could again lead to a lot of new discussion, which would postpone the acceptance of the SMILES field.

Agree. It makes sense to have separate PRs. I will open a separate PR for filter_smarts.

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server.

These are very valid arguments. However, @rartino's post and Friday's Web meeting convinced me otherwise.

In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

I do not see this as a problem. All entries have IDs and they can be used to pick only the unique structures.

rartino commented 2 years ago

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server.

These are very valid arguments. However, @rartino's post and Friday's Web meeting convinced me otherwise.

I agree that allowing intermixed queries gives more flexibility in what queries can be expressed. ~I do not agree about the efficiency.~ Edit: (from the discussion below I see that I misunderstood the efficiency part.)

There are two different possible solutions on the backend for backends that can handle SMILES:

A backend that actually supports executing efficiently an intermixed OPTIMADE + SMILES query. In that case, if we go with allowing intermixed queries, the query is just executed. If we instead go with filter_smarts the backend just combines the two queries to: ~<SMARTS QUERY> OR <OPTIMADE QUERY>~ <SMARTS QUERY> AND <OPTIMADE QUERY>. There is no loss in efficiency, but there is indeed loss in the flexibility of what can be expressed.
A backend that cannot do a mixed query. In that case it has to do some form of unwrapping of the mixed query to handle it - and in most cases reject it. However, if we go with filter_smarts then it will be trivial to support at least a SMILES-only query on a backend that supports that.

Now, the question is - will (1) or (2) be the more common one? My somewhat unfounded suspicion is that there is no backend today that can efficiently do (1).

In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

~No, if a backend that actually supports intermixed queries just translates a two filter-argument query into (<SMARTS QUERY>) OR (<OPTIMADE QUERY>) there is no loss in efficiency and no risk for duplicates.~ Edit: @JPBergsma was right here, then I agree with @merkys that you'd have to remove duplicates by ids.

merkys commented 2 years ago

@rartino

There are two different possible solutions on the backend for backends that can handle SMILES:

A backend that actually supports executing efficiently an intermixed OPTIMADE + SMILES query. In that case, if we go with allowing intermixed queries, the query is just executed. If we instead go with filter_smarts the backend just combines the two queries to: <SMARTS QUERY> OR <OPTIMADE QUERY>. There is no loss in efficiency, but there is indeed loss in the flexibility of what can be expressed.

Didn't you suggest before that given filter and filter_smarts they should be joined as <SMARTS QUERY> AND <OPTIMADE QUERY>, or am I just misunderstanding the cited paragraph?

A backend that cannot do a mixed query. In that case it has to do some form of unwrapping of the mixed query to handle it - and in most cases reject it. However, if we go with filter_smarts then it will be trivial to support at least a SMILES-only query on a backend that supports that.

Now, the question is - will (1) or (2) be the more common one? My somewhat unfounded suspicion is that there is no backend today that can efficiently do (1).

I have the same feeling about (1).

By the way, I have opened PR #398 introducing filter_smarts.

rartino commented 2 years ago

@merkys

Didn't you suggest before that given filter and filter_smarts they should be joined as <SMARTS QUERY> AND <OPTIMADE QUERY>, or am I just misunderstanding the cited paragraph?

Right, sorry - this was just a mistype - replace every "OR" in that reply with "AND" ~other than that I stand by what I said.~

Edit: Eh - I see that my confusion runs deeper. @JPBergsma is right in that OR queries are less efficient in that you'd need to run two queries and will get duplicates; but @merkys is right that they can be matched by ID. Even so, I think this is a less important point than choosing the construct that the majority of backends can support without having to parse and unwrap the query string.

ml-evs commented 2 years ago

Just saw this blogpost on Twitter and thought it would nicely complement the SMARTS discussion for those who don't know already know what it is: Easy way to visualize SMARTS

merkys commented 2 years ago

Citing myself:

Lastly, I wonder whether the syntax and interpretation of SMARTS is the same across these three packages [Indigo Toolkit, Open Babel and RDKit]. As I do not use SMARTS much, I cannot comment, however.

Saubern et al., 2011 present some evidence that SMARTS are understood differently by different cheminformatics packages, the fact which I was almost sure about. Nevertheless, we will have to live with that - I hope the differences are minimal.

BobHanson commented 2 years ago

I'll weigh in here.

Canonicalization. This is a much misunderstood term. "Canonicalization" is a local database strategy that can be used to do a rapid string search for a specific compound. Databases should not/do not require that a user use any particular canonicalization. Maybe they use OpenSmiles v. 2.0.5; maybe they use something else. It doesn't matter in OPTIMADE context, because nobody cares what canonicalization the implementer used. What the database does is to convert the SMILES query to their specifically implemented canonicalization so that can do a direct string match. That's all. Think of "canonicalization" as similar to "software name and version." It's just a given algorithm written at a specific time.

Point: Don't worry about canonicalization.

SMARTS. This is the real power of the SMILES business. The goal is to find substructures within a database - all the compounds that have six-membered aromatic rings with adjacent OH groups, for example a1aa(O[H])a(O[H])aa1. Some databases can do this; others cannot. Again, no relevance of canonicalization. This is a model search, not a string search. But not every database can do this sort of thing.

I agree completely that any SMILES needs to be its own property (perhaps as chemical_SMILES).

merkys commented 2 years ago

@BobHanson Thanks for your opinion here! Could you please also check out the related PR #392 and maybe approve if you agree?

ml-evs commented 2 years ago

Canonicalization. This is a much misunderstood term. "Canonicalization" is a local database strategy that can be used to do a rapid string search for a specific compound. Databases should not/do not require that a user use any particular canonicalization. Maybe they use OpenSmiles v. 2.0.5; maybe they use something else. It doesn't matter in OPTIMADE context, because nobody cares what canonicalization the implementer used. What the database does is to convert the SMILES query to their specifically implemented canonicalization so that can do a direct string match. That's all. Think of "canonicalization" as similar to "software name and version." It's just a given algorithm written at a specific time.

Point: Don't worry about canonicalization.

Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

merkys commented 2 years ago

@ml-evs

Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

There is an ongoing discussion in #416 regarding symmetry properties which I believe may be related as well. I think that canonicalization may be delegated to providers, but if so, it has to be well-specified. Otherwise databases will differ in the way they do it, and we risk returning to pre-OPTIMADE state. Also, query canonicalization will put a strain on providers, not sure if negligible.

BobHanson commented 2 years ago

I really would not worry about canonicalization of SMILES. As long as the SMILES is valid (here specifying OpenSmiles is sufficient) everyone understands valid as generally acceptable. Note that one of the key aspects that distinguishes OpenSmiles is its treatment of aromaticity (which is not the organic chemist's typical view). [1] A key point there is that there is ample use of MAY and PREFERRED rather than MUST. So, for example, aromatic atoms MAY be represented with lower case letters but need not be.

I think the real question is on query. MUST a repository be able to process a SMILES query in a meaningful noncanonical sense, or MAY it treat it as an exact string?

Apologies if this has already been decided and I am repeating myself. Probably have missed a few clicks of this discussion.

[1] http://opensmiles.org/opensmiles.html

On Mon, Jul 4, 2022, 5:45 AM Andrius Merkys @.***> wrote:

@ml-evs https://github.com/ml-evs

Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

There is an ongoing discussion in #416 https://github.com/Materials-Consortia/OPTIMADE/issues/416 regarding symmetry properties which I believe may be related as well. I think that canonicalization may be delegated to providers, but if so, it has to be well-specified. Otherwise databases will differ in the way they do it, and we risk returning to pre-OPTIMADE state. Also, query canonicalization will put a strain on providers, not sure if negligible.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/368#issuecomment-1173665164, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCWYVAH3YTO576E3HNETVSK6GHANCNFSM472Y77EA . You are receiving this because you were mentioned.Message ID: @.***>

ml-evs commented 2 years ago

I think the real question is on query. MUST a repository be able to process a SMILES query in a meaningful noncanonical sense, or MAY it treat it as an exact string? Apologies if this has already been decided and I am repeating myself. Probably have missed a few clicks of this discussion. [1] http://opensmiles.org/opensmiles.html

Hi @BobHanson, I think this has been decided for SMILES, my comment is about whether we should adopt the same approach for simpler fields like chemical formula too.

BobHanson commented 2 years ago

Ah. That makes more sense. Shouldn't that be a different thread and PR? Not

368?

On Mon, Jul 4, 2022, 11:21 AM Matthew Evans @.***> wrote:

I think the real question is on query. MUST a repository be able to process a SMILES query in a meaningful noncanonical sense, or MAY it treat it as an exact string? Apologies if this has already been decided and I am repeating myself. Probably have missed a few clicks of this discussion. [1] http://opensmiles.org/opensmiles.html

Hi @BobHanson https://github.com/BobHanson, I think this has been decided for SMILES, my comment is about whether we should adopt the same approach for simpler fields like chemical formula too.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/368#issuecomment-1173981902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW3NXO5URXYZI7NXCZLVSMFO5ANCNFSM472Y77EA . You are receiving this because you were mentioned.Message ID: @.***>

rartino commented 2 years ago

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

My take on this as an implementer is that I really want fields to have clear data types with strict comparison operator semantics. So, if chemical_formula is a string, then I want = to always mean normal string comparison - no: "but for this field equality also holds if the string has the same elements in a different order". Early drafts of OPTIMADE headed in this direction with each field describing its own operator rules, and IMO that leads to madness (and highly non-interoperable implementations).

Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining chemical_formula_reduced to be a new chemical formula data type with its own clear comparison semantics, i.e., with = meaning unordered comparison over elements, but are < and > allowed? what do they mean?, etc. Furthermore, if used also for chemical_formula_descriptive we need to figure out how = works for constructs with parenthesis, brackets, etc.