Global-Chem / global-chem

A Knowledge Graph of Common Chemical Names to their Molecular Definition
https://globalchemistry.org/
Mozilla Public License 2.0
155 stars 21 forks source link

YL-1 Add Insect Sex Pheromones to GlobalChem #320

Open Lyq322 opened 1 week ago

Lyq322 commented 1 week ago

Add chemicals from Insect Sex Pheromones by Martin Jacobson to GlobalChem

Lyq322 commented 1 week ago

I could not find the R/S configuration for this molecule: 'd-10-acetoxy-cis-7-hexadecen-1-ol': 'OCCCCCC\C=C/CC(OC(=O)C)CCCCCC' Also, should I change the (+)/(-) and d/l in the smiles list to R/S so it is more consistent and easier to understand?

Lyq322 commented 4 days ago

I calculated the tanimoto similarity scores between this list and one of the tranches in the zinc database (AAAA): Screenshot 2024-09-12 at 11 02 50 PM I found that most of the molecules in this list is not similar to the any of the molecules in the zinc database tranche. The maximum tanimoto score was 0.3137 between these two molecules: From zinc database: Screenshot 2024-09-12 at 11 05 47 PM From SMILES list: image

Sulstice commented 4 days ago

Interesting, nice plot! It's probably because their combination algorithms are more drug designed base and less applicable to other chemical spaces. Tanimoto scorring is pretty strict:


    tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fp, ref_fps)
    dice_scores = DataStructs.BulkDiceSimilarity(fp, ref_fps)
    kulczynski_scores = DataStructs.BulkKulczynskiSimilarity(fp, ref_fps)
    mcconnaughey_scores = DataStructs.BulkMcConnaugheySimilarity(fp, ref_fps)
    onbit_scores = DataStructs.BulkOnBitSimilarity(fp, ref_fps)
    rogot_goldberg_scores = DataStructs.BulkRogotGoldbergSimilarity(fp, ref_fps)
    russel_scores = DataStructs.BulkRusselSimilarity(fp, ref_fps)
    sokal_scores = DataStructs.BulkSokalSimilarity(fp, ref_fps)

    if all(x > criteria for x in tanimoto_scores):
        print ('Tanimoto Accepted: %s' % value)
    if all(x > criteria for x in dice_scores):
        print ('Dice Accepted: %s' % value)
    if all(x > criteria for x in kulczynski_scores):
        print ('Kulczynski Accepted: %s' % value)
    if all(x > criteria for x in mcconnaughey_scores):
        print ('Mcconnaughey Accepted: %s' % value)
    if all(x > criteria for x in onbit_scores):
        print ('On Bit Accepted: %s' % value)
    if all(x > criteria for x in rogot_goldberg_scores):
        print ('Rogot Goldberg: %s' % value)
    if all(x > criteria for x in russel_scores):
        print ('Russel: %s' % value)
    if all(x > criteria for x in sokal_scores):
        print ('Sokal: %s' % value)

There's a bunch of other similarity metrics as well that could be useful but by first glance not great. Can we compare to a fragrant database. The problem is that the data is usually sold rather than available open source:

https://github.com/Odeuropa

I found this, is anything we can use in here?

Sulstice commented 4 days ago

@ANUGAMAGE Review this PR and add it as a node into global-chem, this will increase the version as well and we can do a new release.

Lyq322 commented 4 days ago

image Is this molecule incomplete? The -yl suffix makes me think it's an ester. When I google the molecule, I also get results on the cyclopropyl propanoate ester being the pheromone of the American Cockroach and not the cyclopropane.

Sulstice commented 4 days ago

Maybe I wrote it wrong? I will check again on it. There's a little arrow and a star that says "Maybe not work". Idk what I meant there.