CWTSLeiden / networkanalysis

Java package that provides data structures and algorithms for network analysis.
MIT License
145 stars 33 forks source link

Added 64-bits array capabilities #12

Closed vtraag closed 1 year ago

vtraag commented 4 years ago

This PR adds 64-bits capabilities to the networkanalysis package because natives arrays in Java are unfortunately limited to 32-bits.

There are four new classes added to the nl.cwts.util package:

These classes enable 64-bits indices (i.e. longs) which are translated to two ints (using bitwise operators) that index an internal array of arrays. The first array index is called the segment and the second array index is called the offset. The value for a provided long index is then located in values[getSegment(index)][getOffset(index)].

Additionally, these classes replace the old DynamicIntArray and DynamicDoubleArray, since these also need 64-bits support. Hence, all the 64-bits arrays are now also dynamic and can be resized (see ensureCapacity, shrink and resize) and values can be added/removed dynamically (see append, push and pop).

Additionally, these classes needed support for some other functionalities, which are also all implemented. There is one functionality (sorting) for which I thought it would be more straightforward to use an existing implementation (it.unimi.dsi.fastutil.BigArrays). Essentially this uses a similar setup in the background (array of arrays). The newly created classes are essentially similar to the BigArrays implementation in fastutil but provide more convenience functions, so that it is easier to translate the rest of the package to be able to use these 64-bits capabilities. I added the fastutil dependency to the build.gradle file, and it is now automatically bundled in the resulting jar file.

All necessary modifications to all other classes are also made. This essentially entails changing array[i] notations into array.get(i) or array.set(i, ...) notations.

Sometimes there is a more convenient form of iteration possible that using a for-each idiom, for(double x : array) which can be substantially easier to understand. Additionally, it is faster than iterating over long indices that have to be separated into two separate ints each time. This allows to change loops of the kind

for (k = network.firstNeighborIndices[i]; k < network.firstNeighborIndices[i + 1]; k++)
{
    v = network.neighbor.get(k)
    ...
}

by the simpler and more elegant construct

for (int v : network.neighbor(i))

without a performance hit, and even a performance increase.

Unfortunately, in many places these constructions are not immediately useful, because we need to iterate over both the neighbors and the edgeWeights simultaneously, which is not possible in this idiom in Java (at least not in an elegant manner). Hence, except for one construct I now left all for loops as they were.

In addition, I also added some similar convenience iterators in the Network class, allowing to iterate over neighbors, edgeWeights or incidentEdges easily. There is the possibility to replace many loops of the kind of

for (k = network.firstNeighborIndices[i]; k < network.firstNeighborIndices[i + 1]; k++)

by

for (long k : network.incidentEdges(i))

which is substantially easier to understand. This will come at a slight performance hit probably, but I haven't tested this change explicitly. At any rate, I left these for loops as they were originally, and we might look into these possibilities at some later time.

Finally, I included unit tests for all these new classes, including tests for all methods. Additionally, I added a simply unit test for the LeidenAlgorithm and a check to see if the aggregation and quality calculations are consistent. These are automatically run when calling ./gradlew build. I have also verified the new classes with sizes of twice the maximum size of individual arrays (which is a total size of 2 147 483 648, exceeding the maximum length of a 32-bits arrays by a few elements). This revealed one remaining problem in the tests themselves (i.e. not in the code, but some test was incorrect), which I am re-testing now, but I do not expect any problems anymore.

I tested this implementation against an example graph . The original implemetation took 40s, whereas the new implementation took 51s on my laptop. This performance decrease is of course a pity, but without investing substantially more time in rewriting aspects of the code, it is unlikely that we are able to improve the performance substantially. But suggestions are of course welcome.

Comments are welcome!

vtraag commented 3 years ago

I went through all new classes to conform to the code style used throughout this package.

vtraag commented 3 years ago

I rebased the PR against the master branch so that there are no merge conflicts and force-pushed to this PR. Apparently the GitHub actions do not run in the case of merge conflicts.

neesjanvaneck commented 1 year ago

See 15cf64289afd05ecc2da744cefcdf4bbb63f48f5, 94c72d274685c0d2a83446ff8b9bc52885fddaab, 472d10b274fc9124e812d1773bdbf945fc7588c5, and f83fe56df4359e16e9c26e5588b81ba389a4b513 for some bug fixes and enhancements.