alan-turing-institute / network-comparison

An R package implementing the NetEMD and NetDis network comparison measures
MIT License
14 stars 3 forks source link

Add option to map discrete distribution to continuous distribution #20

Closed martintoreilly closed 7 years ago

martintoreilly commented 7 years ago

Goal

Add option to map orbit counts to a continuous distribution as the existing Python code used to generate the paper data does. This will also require amending how the EMD is calculated using the difference of cumulative distribution methods.

Issue arose out of investigation of issue #12.

Acceptance criteria

Notes

Will require amendments to the following functions

Solution chosen is to create a new set of functions to allow:

We have actually chosen to create interpolating Empirical Cumulative Mass Functions. These are more generic as EMD is defined for arbitrary histograms, not just ones with normalised mass, and the cumulative function trick requires mass to be accumulated rather than density.

martintoreilly commented 7 years ago

Trying alternative approach to handle fixed width histograms (i.e. nearest neighbour smoothing). Idea is to refactor dhist class to create an interpolating ecdf using "approxfun" with "constant" (no smoothing) or "linear" with knots at x +/- bin half-width (nearest-neighbour smoothing)

martintoreilly commented 7 years ago

TODO: Amend net_emd and godd to allow the user to choose whether to smooth histograms and set the smoothing bin width

martintoreilly commented 7 years ago

The new method for calculating the area between ECMFs (area_between_dhist_ecmfs) gives very incorrect output (NetEMDs in the tens of millions rather than between 0 and 1).

martintoreilly commented 7 years ago

However, the smoothed NetEMDs from the R-library still do not match those from the pre-existing Python code (tested with orbits of graphlets up to 4 nodes).

martintoreilly commented 7 years ago

On further investigation, integrating the piece-wise linear function abs(ecmf1 - ecmf2) at the combined set of knots is insufficient. For this approach to give the correct answer for "bowtie" segments where the original ecmf1 and ecmf2 functions cross, we need to consider the value of abs(ecmf1-ecmf2) at the "bowtie" cross-over points

martintoreilly commented 7 years ago

I am now pretty confident that the R-code is giving the correct answer.

martintoreilly commented 7 years ago

Closed following confirmation that R-code calculated areas for both smoothed and unsmoothed ECMFs closely match hand-measured areas for both integer and non-integer examples (see earlier comment for details). Further investigation of the differences between the Python and R code outputs will be carried out under issue #21.