BIDS / datarray

Prototyping numpy arrays with named axes for data management.
http://bids.github.com/datarray
Other
87 stars 20 forks source link

Not-deterministic errors on Python 3 for graph example #75

Closed matthew-brett closed 8 years ago

matthew-brett commented 8 years ago

We are getting failures most of the time for the tests on examples/inference_algs.py - e.g. https://travis-ci.org/BIDS/datarray/jobs/136040496#L369 .

The test same test always passes on Python 2.

On Python 3, about 4 in 5 runs of python examples/inference_algs.py generates an error, with a few different wrong answers.

The problem is that:

margs3, lik3 = calc_marginals_sumproduct(cpts,evidence)

generates different results on different runs. Specifically, this call always generates the same value for lik3 within a single Python session, but a different value in different sessions.

Playing with the script, I suspect this is because the graph the script is operating on has a different sequence in different runs, and I believe the depth-first search will differ in that case. I suspect this is a dictionary ordering problem, but I haven't tracked it down.

@terhorst - I think this is your code - do you have any idea what is going on?

terhorst commented 8 years ago

No idea, sorry. (I can barely remember working on this).

matthew-brett commented 8 years ago

Ah - well - the algorithm does seem to depend on dictionary ordering. I think this is because networkx uses the dictionary node ordering for the dfs_tree algorithm when the start nodes are not specified, and the output differs with different ordering. So I believe this is a bug, in that it was coincidence that the code worked for Python 2.7, and the node order needs to be specified on the call to dfs_tree here.

And in fact I can reproduce the errors on Python 2.7 with something like PYTHONHASHSEED=42 python examples/inference_algs.py.

Unfortunately I don't understand the algorithm, so I don't know what the right nodes are for the dfs_tree search.

matthew-brett commented 8 years ago

Exploring further - I don't think it is the depth search, I think it is simply that this block:

    for node in G.nodes():
        potential = multiply_potentials(*[messages[(src,node)] for src in G.neighbors(node)])
        marginals[node] = normalize(potential)

    return marginals, potential.sum()

is iterating over the elements of an underlying dictionary (G.nodes()), so potential.sum() at the end returns some value effectively randomly chosen from all the nodes in the dictionary. I think that we actually want:

    for node in G.nodes():
        potential = multiply_potentials(*[messages[(src,node)] for src in G.neighbors(node)])
        marginals[node] = normalize(potential)
        potentials[node] = potential

    return marginals, potentials[target_node].sum()

where target_node = 'burglary'. But I'm happy to be corrected.

matthew-brett commented 8 years ago

PR to fix at #76

matthew-brett commented 8 years ago

Closed by #76.