Closed mpagni12 closed 1 year ago
We might replace all "Short InChIKey" with their "real", and stereochemically undefined counterpart such as "-UHFFFAOYSA-N"
I spent some time thinking about it and how to implement it without having to change all the architecture we have now. I ended up with this proposition: we add IK2D as attributes of annotations entities (together with the 2D smiles), and we link these annotations to the corresponding IKs that have the same first part (IK2D). As a result, we only have ChemicalEntity objects with a full IK, stereochemically defined or not. Advantages:
I hope this is clear, let me know if it is ok for you 🚀
Yes it make sense if I understand it correctly
I kept ik2d as links between annotations and structures defined by full IK, but they are not ChemicalEntity anymore. 3a6ae8e
I still completely disagree with the chemical compounds representation in ENPKG! I would use the full InChIKey as identifier for any chemical entity, in any case. And I would add a property like enpkg:has_InChI_2D to allow searching for common 2D structure.
Here below a SPARQL UPDATE example about how to automatically split the three parts of an InChIKey (from a projet in collaboration with a team at CEA, by the way):
The fundamental reason I am insisting to use the full InChIkey as identifier is the following: many additional sources of interesting molecules could be added to the graph, e.g. from MetaNetX or LOTUS. These will have no, partial, incomplete or complete stereo. Aggregating redundant molecules from different sources, is easy to perform by using a UNIQUE canonicalisation scheme. The full InChiKey is the best we have at hand for organic chemistry. The first part of the InChIKey alone is far too application specific.
Secondly, instances of enpkg:InChIkey are endowed with property
enpkg:has_smiles
: I am pretty sure that we are going to accumulate SMILES from different origines, so a more precise predicate would be better, e.g.enpkg:has_original_WD_SMILES
and