Actelion / openchemlib

Open source Java-based chemistry library
Other
81 stars 29 forks source link

Substructure search with wildcard #70

Open c-ruttkies opened 2 years ago

c-ruttkies commented 2 years ago

Hi,

I have a problem using the SSSearcher with query features where I cannot really explain why it behaves like it does.

Following code snippet:

public class SubstructureSearcherTest {

final private String query = "\n"
        + "  MJ210900                      \n"
        + "\n"
        + "  8  8  0  0  0  0  0  0  0  0999 V2000\n"
        + "   -0.4035    0.4148    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -1.1180    0.0023    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -1.1180   -0.8228    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -0.4035   -1.2354    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    0.3109   -0.8228    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    0.3109    0.0023    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    1.1180    0.1738    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -0.3173    1.2354    0.0000 A   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "  1  2  2  0  0  0  0\n"
        + "  2  3  1  0  0  0  0\n"
        + "  3  4  2  0  0  0  0\n"
        + "  4  5  1  0  0  0  0\n"
        + "  5  6  2  0  0  0  0\n"
        + "  6  1  1  0  0  0  0\n"
        + "  6  7  1  0  0  0  0\n"
        + "  1  8  1  0  0  0  0\n"
        + "M  END\n"
        + "";

final private String target = "\n"
        + "  MJ210900                      \n"
        + "\n"
        + "  9  9  0  0  0  0  0  0  0  0999 V2000\n"
        + "   -0.4035    0.4148    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -1.1180    0.0023    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -1.1180   -0.8228    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -0.4035   -1.2354    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    0.3109   -0.8228    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    0.3109    0.0023    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "    1.1180    0.1738    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -0.3173    1.2354    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "   -0.4888    2.0423    0.0000 A   0  0  0  0  0  0  0  0  0  0  0  0\n"
        + "  1  2  2  0  0  0  0\n"
        + "  2  3  1  0  0  0  0\n"
        + "  3  4  2  0  0  0  0\n"
        + "  4  5  1  0  0  0  0\n"
        + "  6  1  1  0  0  0  0\n"
        + "  5  6  2  0  0  0  0\n"
        + "  1  8  1  0  0  0  0\n"
        + "  8  9  1  0  0  0  0\n"
        + "  6  7  1  0  0  0  0\n"
        + "M  END\n"
        + "";

@Test
public void checkSimpleSubstructure() {
    System.out.println(this.query);
    System.out.println(this.target);
    final MolfileParser parser = new MolfileParser();
    final StereoMolecule target = new StereoMolecule();
    parser.parse(target, this.target);
    final StereoMolecule query = new StereoMolecule();
    parser.parse(query, this.query);
    query.setFragment(true);
    this.addQueryFeatures(query);
    final SSSearcher matcher = new SSSearcher();
    matcher.setMolecule(target);
    matcher.setFragment(query);
    assertTrue(matcher.isFragmentInMolecule());
}

private void addQueryFeatures(StereoMolecule molecule) {
    IntStream.range(0, molecule.getAtoms()).boxed()
        .filter(idx -> !molecule.getAtomLabel(idx).equals("A"))
        .forEach(idx -> molecule.setAtomQueryFeature(idx, Molecule.cAtomQFNoMoreNeighbours, true));
}

}

I have two mol strings, one query and one target. The query and the target both contain an 'A' for any atom. I add the query feature Molecule.cAtomQFNoMoreNeighbours to all but the 'A' atom of the query. In my opinion, the matcher.isFragmentInMolecule() should return true.

I made some strange investigations on this. When I replace the 'A' with a 'C' atom in the target matcher.isFragmentInMolecule() returns true. When I keep the 'A' atom in the target and don't add the query features (uncomment this.addQueryFeatures(query);) matcher.isFragmentInMolecule() also returns true. Can you explain what's happening here and whether this is expected?

Thanks, Christoph

Might also be interesting for @lutzweber

thsa commented 2 years ago

Dear Christoph,

the behaviour is correct. When parsing your molfiles, then the A atom is converted into an 'any atom' query feature. For both molfiles then the parser concludes that the entity is a fragment rather than a molecule, because molecules cannot contain query features. The substructure search is then a fragment in fragment search. Here atoms only match, if the query atom matches all incarnations of a target atom. This means that any constraint on a query atom must be exist on the target atom. Here the 'no more neighbours' condition on the sulfur is not existing in the target. The target sulfur is includes incarnations, which are not part of the more restricted query. Therefore, it is not a match.

I hope, I could explain it properly...

Best wishes,

Thomas

[cid:1d7851ab-1309-4d6b-866c-a1a2d6c25a19][cid:9748a1b9-f1fd-4efe-81c1-d003f0264e0d]


Von: c-ruttkies @.> Gesendet: Montag, 9. Mai 2022 12:54 An: Actelion/openchemlib @.> Cc: Subscribed @.***> Betreff: [Actelion/openchemlib] Substructure search with wildcard (Issue #70)

Hi,

I have a problem using the SSSearcher with query features where I cannot really explain why it behaves like it does.

Following code snippet:

public class SubstructureSearcherTest {

final private String query = "\n"

final private String target = "\n"

@Test public void checkSimpleSubstructure() { System.out.println(this.query); System.out.println(this.target); final MolfileParser parser = new MolfileParser(); final StereoMolecule target = new StereoMolecule(); parser.parse(target, this.target); final StereoMolecule query = new StereoMolecule(); parser.parse(query, this.query); query.setFragment(true); this.addQueryFeatures(query); final SSSearcher matcher = new SSSearcher(); matcher.setMolecule(target); matcher.setFragment(query); assertTrue(matcher.isFragmentInMolecule()); }

private void addQueryFeatures(StereoMolecule molecule) { IntStream.range(0, molecule.getAtoms()).boxed() .filter(idx -> !molecule.getAtomLabel(idx).equals("A")) .forEach(idx -> molecule.setAtomQueryFeature(idx, Molecule.cAtomQFNoMoreNeighbours, true)); }

}

I have two mol strings, one query and one target. The query and the target both contain an 'A' for any atom. I add the query feature Molecule.cAtomQFNoMoreNeighbours to all but the 'A' atom of the query. In my opinion, the matcher.isFragmentInMolecule() should return true.

I made some strange investigations on this. When I replace the 'A' with a 'C' atom in the target matcher.isFragmentInMolecule() returns true. When I keep the 'A' atom in the target and don't add the query features (uncomment this.addQueryFeatures(query);) matcher.isFragmentInMolecule() also returns true. Can you explain what's happening here and whether this is expected?

Thanks, Christoph

Might also be interesting for @lutzweberhttps://che01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flutzweber&data=05%7C01%7Cthomas.sander%40idorsia.com%7Cdd724a5873344e56a0b808da31aa426c%7Cbb9214bf0cb941fdbd55d0c1c3eda110%7C0%7C0%7C637876904570840186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BH82xu3kcfJzJBRu7M5d9xISu7PEnxP3%2FciaCdbT9G8%3D&reserved=0

— Reply to this email directly, view it on GitHubhttps://che01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FActelion%2Fopenchemlib%2Fissues%2F70&data=05%7C01%7Cthomas.sander%40idorsia.com%7Cdd724a5873344e56a0b808da31aa426c%7Cbb9214bf0cb941fdbd55d0c1c3eda110%7C0%7C0%7C637876904570840186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DxIss6i8o%2Fgx0w%2BQ8Z3Jh3ykQ0yAGZKxMBQWhivTjpI%3D&reserved=0, or unsubscribehttps://che01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACNFEBWYRWS53LUPYOORQI3VJDVFNANCNFSM5VN3U3TA&data=05%7C01%7Cthomas.sander%40idorsia.com%7Cdd724a5873344e56a0b808da31aa426c%7Cbb9214bf0cb941fdbd55d0c1c3eda110%7C0%7C0%7C637876904570840186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NkT5Thrm5Ao99xwvmahJpxN4UbZbxH5fAFkNLIbFZVE%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>


The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. The content of this email is not legally binding unless confirmed by letter. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of the sender's company.