OpenFreeEnergy / openfe

The Open Free Energy toolkit
https://docs.openfree.energy
MIT License
136 stars 19 forks source link

Graph generated by minimal_spanning_graph produces a different map every time #244

Closed djhuggins closed 1 year ago

djhuggins commented 1 year ago

I am generating a network (map) using minimal_spanning_graph:

network = minimal_spanning_graph(ligands=ligand_mols[0:], mappers=[LomapAtomMapper(threed=True, element_change=False),], scorer=scorer)

When I try it twice on the same input it produces two different maps.

dwhswenson commented 1 year ago

I thought that our minimal spanning graph was deterministic, so I'm a bit surprised to see this occur. Of course, the minimal spanning graph is not guaranteed to be unique, especially if the scorer assigns the same weight to multiple edges.

Can you check whether the total score of the network changes? You can do this with:

sum(e.annotations['score'] for e in network.edges)

How are you checking the equality of the networks? I'll note that the visualizations we use are not deterministic, and are expected to give a different display each time, although the actual network should be the same. Internally, a number of things are also stored in an order-independent way as well. However, if you compare two networks using the == operator, i.e., network1 == network2, that will tell you if the networks are the same.

If it is possible to provide a dataset that does this, that would be helpful. I've tried to reproduce with the benzene transformations dataset we use in testing, and have been unable to do so (repeated the calculation 100 times).

One possibility is LOMAP could sometimes fail to return valid mappings for some edges. This could happen if LOMAP reaches its timeout. So maybe one edge takes something close to the default timeout of 20 seconds, and depending on hardware fluctuations sometimes it finishes in time, sometimes not. (Doesn't seem terribly likely, but it is a potential source of different mappings.) You can change this by setting the time argument (timeout in seconds) when creating your LomapAtomMapper.

djhuggins commented 1 year ago

Hello. I noted the networks were different as I print out the components of each edge and they are different. The total score sometimes differs in the last digit (eg 0.48099076305907423 vs 0.4809907630590743). I will send you some input.

dwhswenson commented 1 year ago

Brief update: I looked into this yesterday; the TL;DR is that there's probably nothing to worry about. However, there is something that isn't behaving as we'd expect, and we probably still want to at least understand why, and if possible, fix it.

(Context for team: @djhuggins sent me a copy of his Python script for this and the ligand set, which is the TYK2 ligand set).

My guess is that the tiny difference in total score probably comes from rounding errors from adding the scores in a different order. If you're getting essentially the same total score, then you're almost certainly getting a valid minimal spanning graph, but just a different one.

I used the maximal network I started in #245 to check whether LOMAP was failing to map any edges (either due to timeout or other reason.) 16 nodes in the data set; got the expected 120 edges (16 * (16-1) /2) for a fully-connected network. So this makes a timeout failure very unlikely.

What's interesting is that if I redo the network creation in the same process, I get the same network. It's only if I switch Python sessions that I get a different one. This means that either (1) something is getting cached; or (2) the algorithm is using session-specific information.

My guess is (2), because I know that this algorithm uses sorting. I'm guessing is that somehow either id() or hash() are being used to generate sort keys. Neither of those are preserved across Python sessions. The question is whether that's happening on our end or in NetworkX, which I'll try to investigate today.

dwhswenson commented 1 year ago

Okay, got it figured out. This dives deep into Python internals. But the basic reason is that we store the nodes and edges of a network as frozensets. The order in which you iterate over a set or frozenset in Python is the order in which those objects appear in the internal hash table. (Until CPython 3.6, this was true of dicts as well.) Since Python 3.3, hashes are salted with a random value for each session (this is why I suspected a some ordering based on hash -- that's what it is, but it was coming neither from us nor from NetworkX, but from Python itself!)

So the issue is that when we create the NetworkX graph (which is then used by NetworkX to get the minimal spanning graph), the order differs between sessions, due to the Python hash seed. Since the minimal spanning algorithm is a greedy algorithm, it returns the first graph it finds, which depends on the order of the input.

I've confirmed this two ways:

  1. Setting the env variable PYTHONHASHSEED and running the script multiple times. With a fixed seed (or hashing turned off), I get the exact same output each time.
  2. Changing the code so that we sort our edges/nodes when we create the NetworkX object, which will, I believe, be the solution here. PR on the way to do this.

Note that when you loop over network.edges to print the edges (as in your script) the order of the edges will still be arbitrary (still a frozenset). However, the actual edges included will be the same.

richardjgowers commented 1 year ago

should be fixed in today's release