Closed martintoreilly closed 7 years ago
Trying alternative approach to handle fixed width histograms (i.e. nearest neighbour smoothing). Idea is to refactor dhist
class to create an interpolating ecdf using "approxfun" with "constant" (no smoothing) or "linear" with knots at x +/- bin half-width (nearest-neighbour smoothing)
TODO: Amend net_emd
and godd
to allow the user to choose whether to smooth histograms and set the smoothing bin width
The new method for calculating the area between ECMFs (area_between_dhist_ecmfs
) gives very incorrect output (NetEMDs in the tens of millions rather than between 0 and 1).
dhist_ecmf
method applied the smoothing window. This is now scaled correctly when the histogram is normalised to unit variance (see commit 0657320)area_between_dhist_ecmfs
method to use a much more conceptually and mathematically simpler approach to calculating areas between smoothed ECMFs. (see commit a5a9308). We used to try and calculate the absolute area between the two ECMFs directly for each segment between the combined set of knots. The approach to do this depended on whether the ECMF line segments formed a "trapezium", "bowtie" or "triangle", and we were suffering numerical instability issues with the maths and/or making errors in it. Now we simply calculate the area under each segment of the piece-wise linear function defined by abs(ecmf1 - ecmf2)
at the combined set of knots. This simply requires using the same trapezium formula for each segment. It's much less likely this code is giving incorrect outputs. The area calculation is also giving the right answer for some test cases comparing histograms shifted against themselves.However, the smoothed NetEMDs from the R-library still do not match those from the pre-existing Python code (tested with orbits of graphlets up to 4 nodes).
On further investigation, integrating the piece-wise linear function abs(ecmf1 - ecmf2)
at the combined set of knots is insufficient. For this approach to give the correct answer for "bowtie" segments where the original ecmf1
and ecmf2
functions cross, we need to consider the value of abs(ecmf1-ecmf2)
at the "bowtie" cross-over points
I am now pretty confident that the R-code is giving the correct answer.
Closed following confirmation that R-code calculated areas for both smoothed and unsmoothed ECMFs closely match hand-measured areas for both integer and non-integer examples (see earlier comment for details). Further investigation of the differences between the Python and R code outputs will be carried out under issue #21.
Goal
Add option to map orbit counts to a continuous distribution as the existing Python code used to generate the paper data does. This will also require amending how the EMD is calculated using the difference of cumulative distribution methods.
Issue arose out of investigation of issue #12.
Acceptance criteria
Notes
Will require amendments to the following functions
net_emd
emd_cs
Solution chosen is to create a new set of functions to allow:
We have actually chosen to create interpolating Empirical Cumulative Mass Functions. These are more generic as EMD is defined for arbitrary histograms, not just ones with normalised mass, and the cumulative function trick requires mass to be accumulated rather than density.