Add option to map discrete distribution to continuous distribution

alan-turing-institute / network-comparison

An R package implementing the NetEMD and NetDis network comparison measures

MIT License

14 stars 3 forks source link

Add option to map discrete distribution to continuous distribution #20

Closed martintoreilly closed 7 years ago

martintoreilly commented 7 years ago

Goal

Add option to map orbit counts to a continuous distribution as the existing Python code used to generate the paper data does. This will also require amending how the EMD is calculated using the difference of cumulative distribution methods.

Issue arose out of investigation of issue #12.

Acceptance criteria

[x] Allow user to specify a fixed width to use for "top hat" smoothing that redistributes the mass at each discrete location into a fixed width bin centred on the location.
[ ] Correctly calculate the NetEMD for smoothed histograms (see issue #12 for comparison)

Notes

Will require amendments to the following functions

[x] net_emd
[x] emd_cs

Solution chosen is to create a new set of functions to allow:

[x] The creation of Empirical Cumulative Density Functions that do piecewise constant (no smoothing) or piecewise linear (with smoothing) interpolation between location (no smoothing) or smoothing bin edges (with smoothing)
[x] The creation of a method to calculate the area between smoothed ECDFs

We have actually chosen to create interpolating Empirical Cumulative Mass Functions. These are more generic as EMD is defined for arbitrary histograms, not just ones with normalised mass, and the cumulative function trick requires mass to be accumulated rather than density.

martintoreilly commented 7 years ago

Trying alternative approach to handle fixed width histograms (i.e. nearest neighbour smoothing). Idea is to refactor dhist class to create an interpolating ecdf using "approxfun" with "constant" (no smoothing) or "linear" with knots at x +/- bin half-width (nearest-neighbour smoothing)

martintoreilly commented 7 years ago

Added method to build interpolating empirical cumulative mass functions (ECMFs) for both discrete histograms and histograms smoothed with a fixed width bin (commit 57195af). This function takes parameters to normalise histogram mass (making an ECDF) and variance (required for NetEMD calculation)
Added method to calculate the area between two discrete histogram ECMFs (commit 2c87015)
Amended EMD method to use new ECMF functions and take a parameter determining the bin width for smoothing, allowing EMD to be calculated for smoothed histograms (commit fedf104)

TODO: Amend net_emd and godd to allow the user to choose whether to smooth histograms and set the smoothing bin width

martintoreilly commented 7 years ago

The new method for calculating the area between ECMFs (area_between_dhist_ecmfs) gives very incorrect output (NetEMDs in the tens of millions rather than between 0 and 1).

martintoreilly commented 7 years ago

Fixed error in how the dhist_ecmf method applied the smoothing window. This is now scaled correctly when the histogram is normalised to unit variance (see commit 0657320)
Completely rewrote area_between_dhist_ecmfs method to use a much more conceptually and mathematically simpler approach to calculating areas between smoothed ECMFs. (see commit a5a9308). We used to try and calculate the absolute area between the two ECMFs directly for each segment between the combined set of knots. The approach to do this depended on whether the ECMF line segments formed a "trapezium", "bowtie" or "triangle", and we were suffering numerical instability issues with the maths and/or making errors in it. Now we simply calculate the area under each segment of the piece-wise linear function defined by abs(ecmf1 - ecmf2) at the combined set of knots. This simply requires using the same trapezium formula for each segment. It's much less likely this code is giving incorrect outputs. The area calculation is also giving the right answer for some test cases comparing histograms shifted against themselves.

However, the smoothed NetEMDs from the R-library still do not match those from the pre-existing Python code (tested with orbits of graphlets up to 4 nodes).

martintoreilly commented 7 years ago

On further investigation, integrating the piece-wise linear function abs(ecmf1 - ecmf2) at the combined set of knots is insufficient. For this approach to give the correct answer for "bowtie" segments where the original ecmf1 and ecmf2 functions cross, we need to consider the value of abs(ecmf1-ecmf2) at the "bowtie" cross-over points

martintoreilly commented 7 years ago

ECMF area calculation now uses an implementation of the segment intersection point algorithm from "Computational Geometry in C", J. O'Rourke, 1994, pp 220-226 (http://crtl-i.com/PDF/comp_c.pdf). This is used to identify bowtie segments and their intersection points to feed into a bowtie-specific area calculation (area of the two bowtie triangles). [commit 82f41b8]
Added test with simple hand-crafted histograms with hand-calculated expected areas between ECMFs for smoothed and unsmoothed cases (all locations are integers, all bin edges and masses are on 1/2 integer grid and all intersection points are on 1/4 integer grid). Test histograms include all segment types between ECMFs (triangle, symmetric + asymmetric bowtie, trapezium). Test passes for both smoothed and unsmoothed histograms. [commit 82f41b8]
Added more complex test based on normalising the simple integer test histograms to unit mass and variance. This results in locations and masses at a range of floating point locations and two ECMFs that are not on the same "grid" at all. Overlaid ECMFs still have all types of segment (trapeziums, bowties and triangles). Measurements for expected area between ECMFs made by hand on printed copies of the normalised ECMFs with a grid having x-spacing of 0.02 and y-spacing of 0.01. Computed area between ECMFs within 1% of manually-measured expected area for both smoothed and unsmoothed cases. [commit 9c92d71]

I am now pretty confident that the R-code is giving the correct answer.

martintoreilly commented 7 years ago

Closed following confirmation that R-code calculated areas for both smoothed and unsmoothed ECMFs closely match hand-measured areas for both integer and non-integer examples (see earlier comment for details). Further investigation of the differences between the Python and R code outputs will be carried out under issue #21.