digicatapult / fornax

Approximate fuzzy subgraph matching in polynomial time
Apache License 2.0
21 stars 4 forks source link

Enable 64bit node identifiers #15

Closed Dan-Staff closed 5 years ago

Dan-Staff commented 5 years ago

The key problem here is doing creating numpy arrays like this:

np.array([      
         query_result.v,
         query_result.u,
         query_result.vv,
         query_result.uu,
         cost
])

numpy arrays a homogeneous so all of the elements in this array will be upcast to floating point representations.

If however any of the elements are int64 values (such as query_result.v which is a list of node ids) then there will be a loss of precision.

Until now node ids have been limited to int32 so this wasn't a problem.

To solve this issue I construct the arrays directly as rec arrays which allow mixed datatypes. E.g:

np.rec.fromarrays(
        [grouped.v, grouped.u, grouped.vv, grouped.cost],
        dtype=list(zip(
            PartialMatchingCosts.columns,
            PartialMatchingCosts.types
        )).view(PartialMatchingCosts)

Then I create a view with the correct subclass of recarray (we could remove this subclasses, but that's a discussion for another pull request).

You will see if you just checkout the first commit off this pull request (that allows ids to be hashed into the full range of int64) the tests will fail due to loss of precision.