Chemical entity - Githubissues

mpagni12 commented 1 year ago

I still completely disagree with the chemical compounds representation in ENPKG! I would use the full InChIKey as identifier for any chemical entity, in any case. And I would add a property like enpkg:has_InChI_2D to allow searching for common 2D structure.

Here below a SPARQL UPDATE example about how to automatically split the three parts of an InChIKey (from a projet in collaboration with a team at CEA, by the way):

INSERT {
    GRAPH <cea:k1k2k3> {
        ?c  cea:has_prop_ik1 ?ik1 ;
            cea:has_prop_ik2 ?ik2 ;
            cea:has_prop_ik3 ?ik3
    }
}
WHERE{
    ?c a mnx:CHEM ;
       mnx:inchikey ?ik  .
    BIND( SUBSTR( ?ik, 9 + 1, 14 )  AS ?ik1 )
    BIND( SUBSTR( ?ik, 9 + 16, 10 ) AS ?ik2 )
    BIND( SUBSTR( ?ik, 9 + 27, 1 )  AS ?ik3 )
}

The fundamental reason I am insisting to use the full InChIkey as identifier is the following: many additional sources of interesting molecules could be added to the graph, e.g. from MetaNetX or LOTUS. These will have no, partial, incomplete or complete stereo. Aggregating redundant molecules from different sources, is easy to perform by using a UNIQUE canonicalisation scheme. The full InChiKey is the best we have at hand for organic chemistry. The first part of the InChIKey alone is far too application specific.

Secondly, instances of enpkg:InChIkey are endowed with property enpkg:has_smiles: I am pretty sure that we are going to accumulate SMILES from different origines, so a more precise predicate would be better, e.g. enpkg:has_original_WD_SMILES and

enpkg:has_original_WD_SMILES rdfs:subPropertyOf enpkg:has_SMILES

oolonek commented 1 year ago

We might replace all "Short InChIKey" with their "real", and stereochemically undefined counterpart such as "-UHFFFAOYSA-N"

ArnaudGaudry commented 1 year ago

I spent some time thinking about it and how to implement it without having to change all the architecture we have now. I ended up with this proposition: we add IK2D as attributes of annotations entities (together with the 2D smiles), and we link these annotations to the corresponding IKs that have the same first part (IK2D). As a result, we only have ChemicalEntity objects with a full IK, stereochemically defined or not. Advantages:

Not much changes to do.
Easy addition of other annotations tools results (since all structural annotations tools will return at least the short IK)
Interconnection with other structural DB, since we have only one type of ChemicalEntity defined by a full IK.

I hope this is clear, let me know if it is ok for you 🚀

mpagni12 commented 1 year ago

Yes it make sense if I understand it correctly

ArnaudGaudry commented 1 year ago

I kept ik2d as links between annotations and structures defined by full IK, but they are not ChemicalEntity anymore. 3a6ae8e

enpkg / enpkg_graph_builder

Chemical entity #6