Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
82 stars 37 forks source link

Add SMILES property #368

Open JPBergsma opened 3 years ago

JPBergsma commented 3 years ago

Do we want to allow the use of smiles string in the field chemical_formula_descriptive ? The SMILES notation for molecular formulas uses '#' and '$' to indicate triple and quadruple bonds, the characters '/' and '\' to indicate whether the bonds are in the cis or trans orientation and '@' and '@@' to differentiate enantiomers. Finally, ring numbers with more than one digit have to be preceded by a '%' sign.
It, therefore, seems reasonable to me to add these to the allowed characters for the chemical_formula_descriptive field.

Or do you think we should add a separate SMILES field instead?

merkys commented 2 years ago

Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining chemical_formula_reduced to be a new chemical formula data type with its own clear comparison semantics, i.e., with = meaning unordered comparison over elements, but are < and > allowed? what do they mean?, etc. Furthermore, if used also for chemical_formula_descriptive we need to figure out how = works for constructs with parenthesis, brackets, etc.

I agree with @rartino here, but I would really prefer keeping things simple. My main concern is that both defining the new semantics and implementing them (properly) would require much effort.

BobHanson commented 2 years ago

What an interesting quandary. it seems to me that "reduced" here is a fine qualifier that can specify "O2Si" and not "SiO2". No one will know what "reduced" means unless they read the information anyway, and that information can explicitly say, "for example, 'O2Si', not 'SiO2' " to make it absolutely clear what is required. Totally with the idea that a string is a string. (Except in the case of SMILES, which I would argue is a special case.) Machines will not care.

Bob

On Wed, Jul 6, 2022 at 2:02 PM Rickard Armiento @.***> wrote:

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

My take on this as an implementer is that I really want fields to have clear data types with strict comparison operator semantics. So, if chemical_formula is a string, then I want = to always mean normal string comparison - no: "but for this field equality also holds if the string has the same elements in a different order". Early drafts of OPTIMADE headed in this direction with each field describing its own operator rules, and IMO that leads to madness (and highly non-interoperable implementations).

Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining chemical_formula_reduced to be a new chemical formula data type with its own clear comparison semantics, i.e., with = meaning unordered comparison over elements, but are < and > allowed? what do they mean?, etc. Furthermore, if used also for chemical_formula_descriptive we need to figure out how = works for constructs with parenthesis, brackets, etc.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/368#issuecomment-1176570743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW3AVTB5IRCLFAOFAZTVSXJ3ZANCNFSM472Y77EA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

JPBergsma commented 2 years ago

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

I think we should return an error message in this case, stating that the value for the chemical elements should be in alphabetical order.

BobHanson commented 1 year ago

I suggest if a server is expected to return an error, then it could just as easily normalize any order and continue. The algorithm would be about the same.

On Fri, Sep 23, 2022 at 2:14 PM Johan Bergsma @.***> wrote:

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

I think we should return an error message in this case, stating that the value for the chemical elements should be in alphabetical order.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/368#issuecomment-1256582811, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW2MAPT7VJFIJJLA4QTV7X6TDANCNFSM472Y77EA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

JPBergsma commented 1 year ago

The code for checking whether each value is smaller than the next value is much simpler than that for a sorting algorithm. Although higher programming languages can provide their own sorting algorithms, so in terms of programming work it may not make much difference.

I think it would be good if a server gives an error when a query is malformed. It is easy to make a typo, and this way we can at least in some cases inform the user about this. This does not only apply to the chemical formula fields, but to all other fields as well. So, it would be good for consistency to return an error when a user gives an invalid chemical formula. In this case, ~it may be easy and unambiguous to convert it to a valid query~, but this is not possible for many of the other query fields. If we do want to accept SiO2, we must in my opinion update the optimade specification.

ps. (If the user queries for SC did he/she mean to search for CS or Sc?)

merkys commented 1 year ago

I completely agree with @JPBergsma on reporting malformed queries as errors and possibility to relax the specification in the future. I would not hurry with the latter, though.

rartino commented 1 year ago

it seems to me that "reduced" here is a fine qualifier that can specify "O2Si" and not "SiO2". No one will know what "reduced" means unless they read the information anyway, and that information can explicitly say, "for example, 'O2Si', not 'SiO2' " to make it absolutely clear what is required.

Maybe I misunderstand you, but as far as I know the word "reduced" in chemical formula is rather meant to refer to the following requirement (quoted from the specification): "For structures with no partial occupation, the chemical proportion numbers are the smallest integers for which the chemical proportion is exactly correct." I think this is a fairly standard use of "reduced"?

There is no word in the field name meant to state the need to order elements. That is "just" a part of the specification ("elements MUST be placed in alphabetical order, followed by their integer chemical proportion number.")

I think one ends up with rather different viewpoints here if one views OPTIMADE as "the user interface" for materials data queries, or "just" an underlying standardized communication protocol. I see no problem with, e.g., Jmol sorting elements for a user who use OPTIMADE to query an OPTIMADE database for a chemical formula before sending the query to OPTIMADE.

Totally with the idea that a string is a string. (Except in the case of SMILES, which I would argue is a special case.) Machines will not care.

We probably need to pick up the discussion again in the smiles thread on what semantics people who want to filter on smiles want. I would argue that if they are different from strings, there should be a smiles datatype.

merkys commented 1 year ago

I think one ends up with rather different viewpoints here if one views OPTIMADE as "the user interface" for materials data queries, or "just" an underlying standardized communication protocol. I see no problem with, e.g., Jmol sorting elements for a user who use OPTIMADE to query an OPTIMADE database for a chemical formula before sending the query to OPTIMADE.

Well put. I view OPTIMADE as "just" an underlying standardized communication protocol, hence my animosity towards some of provider-intensive extensions.

We probably need to pick up the discussion again in the smiles thread on what semantics people who want to filter on smiles want. I would argue that if they are different from strings, there should be a smiles datatype.

Maybe this is the right thread to do so? Or probably even better way would be to put together an alternative to PR #392 defining SMILES as datatype with its own query semantics. Admittedly, I am not a fan (the COD will not be able to handle such queries; I cannot see a way to elegantly introduce SMILES datatype at grammar level, although we have timestamp datatype), but I can start a draft.

merkys commented 1 year ago

As promised, I have created PR #436 introducing SMILES data type.