cdk / cdk

The Chemistry Development Kit
https://cdk.github.io/
GNU Lesser General Public License v2.1
485 stars 157 forks source link

breaking change in SeedGenerator? #914

Closed uli-f closed 12 months ago

uli-f commented 1 year ago

I updated my project to cdk28 and realized that the hash values for IAtomContainers changed.

It took me a while to track this down. I finally figured out that a change in org.openscience.cdk.hash.SeedGenerator.generate is responsible for this that puts a positive sign on the product of atomHash and seed:

for (int i = 0; i < n; i++) {
  int atomHash = encoder.encode(container.getAtom(i), container);
  hashes[i] = distribute((seed * atomHash) & 0x7fffffff);
}

I would consider this a breaking change as hash values calculated with earlier versions of CDK (in my case 271) changed with CDK 2.8. But I fully appreciate that this is my personal perspective.

Anyway, I added two test cases to SeedGeneratorTest in PR #913 to catch a similar change in the future and make us aware of the change so that it potentially can be included in the release notes.

johnmay commented 1 year ago

Oh it was almost certainly an unexpected change we should revert..

johnmay commented 1 year ago

As a rule of thumb though... these hashes are meant to be fleeting and tied to a specific version of CDK. Otherwise you get into an InChI situation where they will never change anything :-).

johnmay commented 1 year ago

Was this the change, https://github.com/cdk/cdk/commit/5b5fcd3d5d08d27eec60f1ebef699cf35c3247c2#diff-05553058975e75eb43cf0b94414f29f44510630a0a647610876b5446c4b55a93.

You can see it was part of a large change set from a SonarCloud alert. Please update to keep the old behaviour if you need it. I don't think it's unreasonable for hashes to change but this accidental.

uli-f commented 1 year ago

As a rule of thumb though... these hashes are meant to be fleeting and tied to a specific version of CDK. Otherwise you get into an InChI situation where they will never change anything :-).

I am happy to fully acknowledge that this is the contract for HashGenerator.

But I do think there is some middle grounds where you don't guarantee that the hashes stay the same, but still have tests in place to be aware if they change and put something in the release notes.

If this is something we can agree on for the HashGenerator I'd be happy to

Only downside I see is that the tests have to be changed as well if there actually are changes which is just is a bit of extra work.

uli-f commented 1 year ago

Was this the change, 5b5fcd3#diff-05553058975e75eb43cf0b94414f29f44510630a0a647610876b5446c4b55a93.

It seems to be this commit here where you added to put a positive sign on the long that is the product of atomHash and seed:

https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446

johnmay commented 1 year ago

Yeah I see it, for some reason a downstream test broke:

https://pubchem.ncbi.nlm.nih.gov/compound/44333798 https://pubchem.ncbi.nlm.nih.gov/compound/57170558

Initially without a stronger tie-break these (should) generated the same HASH. But with that int -> long they didn't. I will look deeper as may be there is something else wrong downstream. I am fully happy to revert back.

@Test
void testScenario() {

    IAtomContainer cid4433798 = cid44333798();
    IAtomContainer cid57170558 = cid57170558();

    MoleculeHashGenerator basic = new HashGeneratorMaker().depth(12).elemental().perturbed().molecular();
    // basic equivalence method can't tell these apart
    assertThat(basic.generate(cid4433798), is(basic.generate(cid57170558)));

    MoleculeHashGenerator cmplx = new HashGeneratorMaker().depth(12).elemental()
            .perturbWith(new MinimumEquivalentCyclicSetUnion()).molecular();

    // complex equivalence method can tell these apart
    assertThat(cmplx.generate(cid4433798), is(not(cmplx.generate(cid57170558))));
}

Out of interest what do you do for your "final check", my opinion on this has changed since I originally wrote this code such that I would now just generate a good Canonical SMILES and then add/remove the properties I want.

https://nextmovesoftware.github.io/molhash/introduction.html

https://github.com/nextmovesoftware/molhash/

uli-f commented 1 year ago

Yeah I see it, for some reason a downstream test broke:

https://pubchem.ncbi.nlm.nih.gov/compound/44333798 https://pubchem.ncbi.nlm.nih.gov/compound/57170558

Yes, I would expect any test to break as soon as any of its atomHashes is a negative number due to & 0x7fffffff operation.

I don't think the change of seed from int to long had any impact here on the actual outcome as

Initially without a stronger tie-break these (should) generated the same HASH. But with that int -> long they didn't. I will look deeper as may be there is something else wrong downstream. I am fully happy to revert back.

I got a fair bit of integration tests in place on my side, and for the same reaction canonical smiles did not change, but the hashes changed when switching from CDK271 to CDK28. So I am pretty sure I found the reason for that in the commit https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446 pointed out above. But I am very happy to stand corrected in case you come up with anything else :)

Out of interest what do you do for your "final check", my opinion on this has changed since I originally wrote this code such that I would now just generate a good Canonical SMILES and then add/remove the properties I want.

I am not quite sure if I understand your question... 🤷🏼 Basically, I have subgraphs that I would like to uniquely identify by assigning numbers to them. So in my use case, the hashes seem like an appropriate solution.

Can you elaborate on what you mean by generate a good Canonical SMILES and then add/remove the properties I want? I assume using this approach I would end up with strings instead of numbers as identifiers?

johnmay commented 1 year ago

I don't think the change of seed from int to long had any impact here on the actual outcome as

It does change the hash, but yes the masking also changes it, just testing locally. But that test failing is an artifact of better hashing.

I am not quite sure if I understand your question... 🤷🏼 Basically, I have subgraphs that I would like to uniquely identify by assigning numbers to them. So in my use case, the hashes seem like an appropriate solution.

Can you elaborate on what you mean by generate a good Canonical SMILES and then add/remove the properties I want? I assume using this approach I would end up with strings instead of numbers as identifiers?

This is a hash, it does not uniquely identify, it's only probabilistic as a quick first check. There can/will be false positives even if low.

The final check is really you need an isomorphism check or a variable length canonical identifier. The realisation I have now is the hash is essentially as expensive as just computing the canonical identifier (basically the same algorithm) so you may as well just do that. The CDK Canonical SMILES is not there yet as a suitable replacement but I can/intend to add some updates. It's the same issue with Canonical SMILES, they should not change between releases but only a major milestones.

To clarify I absolutely agree this should not have changed, it was a mistake.

uli-f commented 1 year ago

This is a hash, it does not uniquely identify, it's only probabilistic as a quick first check. There can/will be false positives even if low.

Thanks, that makes it clear. I am quite happy to accept a low probability of clashes.

The final check is really you need an isomorphism check or a variable length canonical identifier.

So the variable length canonical identifier would be a string?

To clarify I absolutely agree this should not have changed, it was a mistake.

Does that mean you are opting for reverting the commit https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446?

johnmay commented 1 year ago

Yep, need to play around a little,

johnmay commented 1 year ago

Merged in the test case, still have to option revert though although I believe your view is hopefully going forwards we should be able to detect the change and warn due to the new test?

uli-f commented 1 year ago

Thanks for merging.

I am happy to keep the implementation with the changed hash values, i.e. no need to revert.

To make it more likely that any (un)intential changes are picked up by tests I would like to add a few test cases to SeedGeneratorTest and assert that their hashes are equal to an expected value. The expected value would be calculated with the latest release of CDK (=2.8) so that we would be able to pick up any changes introduced before the next release.

I would also like to add a sentence to the javadoc of SeedGenerator: The hash values generated by this class are not guaranteed to stay the same over time. However, the intention is to mention any changes to the generated hash values in the release notes.

Happy to listen to any input/ideas/opinion.

johnmay commented 1 year ago

Sounds good

uli-f commented 1 year ago

Opened PR #940

After giving it some thought I added the test and the documentation further up the hash value generator call stack. More specifically, I added

My intention is to catch any changes to the computation of the hash values for molecules, so this seems the right place. Whether the change originates from an implementation of MoleculeHashGenerator or one of the components used to compute the molecule hash doesn't make a difference to the outcome as I look at it.

I hope that makes sense to you, too.

uli-f commented 12 months ago

Closed as PR #940 was merged.