Closed uli-f closed 1 year ago
Oh it was almost certainly an unexpected change we should revert..
As a rule of thumb though... these hashes are meant to be fleeting and tied to a specific version of CDK. Otherwise you get into an InChI situation where they will never change anything :-).
Was this the change, https://github.com/cdk/cdk/commit/5b5fcd3d5d08d27eec60f1ebef699cf35c3247c2#diff-05553058975e75eb43cf0b94414f29f44510630a0a647610876b5446c4b55a93.
You can see it was part of a large change set from a SonarCloud alert. Please update to keep the old behaviour if you need it. I don't think it's unreasonable for hashes to change but this accidental.
As a rule of thumb though... these hashes are meant to be fleeting and tied to a specific version of CDK. Otherwise you get into an InChI situation where they will never change anything :-).
I am happy to fully acknowledge that this is the contract for HashGenerator
.
But I do think there is some middle grounds where you don't guarantee that the hashes stay the same, but still have tests in place to be aware if they change and put something in the release notes.
If this is something we can agree on for the HashGenerator
I'd be happy to
Only downside I see is that the tests have to be changed as well if there actually are changes which is just is a bit of extra work.
Was this the change, 5b5fcd3#diff-05553058975e75eb43cf0b94414f29f44510630a0a647610876b5446c4b55a93.
It seems to be this commit here where you added to put a positive sign on the long
that is the product of atomHash
and seed
:
https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446
Yeah I see it, for some reason a downstream test broke:
https://pubchem.ncbi.nlm.nih.gov/compound/44333798 https://pubchem.ncbi.nlm.nih.gov/compound/57170558
Initially without a stronger tie-break these (should) generated the same HASH. But with that int -> long they didn't. I will look deeper as may be there is something else wrong downstream. I am fully happy to revert back.
@Test
void testScenario() {
IAtomContainer cid4433798 = cid44333798();
IAtomContainer cid57170558 = cid57170558();
MoleculeHashGenerator basic = new HashGeneratorMaker().depth(12).elemental().perturbed().molecular();
// basic equivalence method can't tell these apart
assertThat(basic.generate(cid4433798), is(basic.generate(cid57170558)));
MoleculeHashGenerator cmplx = new HashGeneratorMaker().depth(12).elemental()
.perturbWith(new MinimumEquivalentCyclicSetUnion()).molecular();
// complex equivalence method can tell these apart
assertThat(cmplx.generate(cid4433798), is(not(cmplx.generate(cid57170558))));
}
Out of interest what do you do for your "final check", my opinion on this has changed since I originally wrote this code such that I would now just generate a good Canonical SMILES and then add/remove the properties I want.
https://nextmovesoftware.github.io/molhash/introduction.html
Yeah I see it, for some reason a downstream test broke:
https://pubchem.ncbi.nlm.nih.gov/compound/44333798 https://pubchem.ncbi.nlm.nih.gov/compound/57170558
Yes, I would expect any test to break as soon as any of its atomHashes is a negative number due to & 0x7fffffff
operation.
I don't think the change of seed
from int
to long
had any impact here on the actual outcome as
seed
as an int
could potentially overflowlong
in this statement anyway: distribute((seed * atomHash) & 0x7fffffff)
Initially without a stronger tie-break these (should) generated the same HASH. But with that int -> long they didn't. I will look deeper as may be there is something else wrong downstream. I am fully happy to revert back.
I got a fair bit of integration tests in place on my side, and for the same reaction canonical smiles did not change, but the hashes changed when switching from CDK271 to CDK28. So I am pretty sure I found the reason for that in the commit https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446 pointed out above. But I am very happy to stand corrected in case you come up with anything else :)
Out of interest what do you do for your "final check", my opinion on this has changed since I originally wrote this code such that I would now just generate a good Canonical SMILES and then add/remove the properties I want.
I am not quite sure if I understand your question... 🤷🏼 Basically, I have subgraphs that I would like to uniquely identify by assigning numbers to them. So in my use case, the hashes seem like an appropriate solution.
Can you elaborate on what you mean by generate a good Canonical SMILES and then add/remove the properties I want? I assume using this approach I would end up with strings instead of numbers as identifiers?
I don't think the change of seed from int to long had any impact here on the actual outcome as
It does change the hash, but yes the masking also changes it, just testing locally. But that test failing is an artifact of better hashing.
I am not quite sure if I understand your question... 🤷🏼 Basically, I have subgraphs that I would like to uniquely identify by assigning numbers to them. So in my use case, the hashes seem like an appropriate solution.
Can you elaborate on what you mean by generate a good Canonical SMILES and then add/remove the properties I want? I assume using this approach I would end up with strings instead of numbers as identifiers?
This is a hash, it does not uniquely identify, it's only probabilistic as a quick first check. There can/will be false positives even if low.
The final check is really you need an isomorphism check or a variable length canonical identifier. The realisation I have now is the hash is essentially as expensive as just computing the canonical identifier (basically the same algorithm) so you may as well just do that. The CDK Canonical SMILES is not there yet as a suitable replacement but I can/intend to add some updates. It's the same issue with Canonical SMILES, they should not change between releases but only a major milestones.
To clarify I absolutely agree this should not have changed, it was a mistake.
This is a hash, it does not uniquely identify, it's only probabilistic as a quick first check. There can/will be false positives even if low.
Thanks, that makes it clear. I am quite happy to accept a low probability of clashes.
The final check is really you need an isomorphism check or a variable length canonical identifier.
So the variable length canonical identifier would be a string?
To clarify I absolutely agree this should not have changed, it was a mistake.
Does that mean you are opting for reverting the commit https://github.com/cdk/cdk/commit/e381f124dc7b0f0eb276f5b3681ecd042299c446?
Yep, need to play around a little,
Merged in the test case, still have to option revert though although I believe your view is hopefully going forwards we should be able to detect the change and warn due to the new test?
Thanks for merging.
I am happy to keep the implementation with the changed hash values, i.e. no need to revert.
To make it more likely that any (un)intential changes are picked up by tests I would like to add a few test cases to SeedGeneratorTest
and assert that their hashes are equal to an expected value. The expected value would be calculated with the latest release of CDK (=2.8) so that we would be able to pick up any changes introduced before the next release.
I would also like to add a sentence to the javadoc of SeedGenerator
:
The hash values generated by this class are not guaranteed to stay the same over time. However, the intention is to mention any changes to the generated hash values in the release notes.
Happy to listen to any input/ideas/opinion.
Sounds good
Opened PR #940
After giving it some thought I added the test and the documentation further up the hash value generator call stack. More specifically, I added
MoleculeHashGenerator
BasicMoleculeHashGeneratorTest
My intention is to catch any changes to the computation of the hash values for molecules, so this seems the right place. Whether the change originates from an implementation of MoleculeHashGenerator
or one of the components used to compute the molecule hash doesn't make a difference to the outcome as I look at it.
I hope that makes sense to you, too.
Closed as PR #940 was merged.
I updated my project to cdk28 and realized that the hash values for
IAtomContainer
s changed.It took me a while to track this down. I finally figured out that a change in
org.openscience.cdk.hash.SeedGenerator.generate
is responsible for this that puts a positive sign on the product ofatomHash
andseed
:I would consider this a breaking change as hash values calculated with earlier versions of CDK (in my case 271) changed with CDK 2.8. But I fully appreciate that this is my personal perspective.
Anyway, I added two test cases to
SeedGeneratorTest
in PR #913 to catch a similar change in the future and make us aware of the change so that it potentially can be included in the release notes.