Verify that correct NetEMD is generated for example networks (e.g. Random and PPI)

martintoreilly commented 7 years ago

Goal

To have confidence that the NetEMD implementation in the whole is producing the same results as the code the research team have been using to generate the NetEMD paper results.

Acceptance criteria

[ ] NetEMDs generated by R package are equal to the NetEMDs generated by existing C/C++ based Python code to within acceptable tolerances (e.g. 1e-12).
- [x] For graphlets up to 4 nodes
- [ ] For graphlets up to 5 nodes
[ ] Any initial discrepancies are fully understood and changes made to the incorrect implementation.
- [ ] For graphlets up to 4 nodes
- [ ] For graphlets up to 5 nodes

martintoreilly commented 7 years ago

I used both the existing Python code and the current R code (commit df8a9f6, snapshotted as branch r-v-python-comparison-1) to generate NetEMDs between all pairings of 5,000 node / 20 density graphs in the package inst/extdata/random folder at this commit (i.e. matching filename pattern .*5000.*). These are the graphs from the RG1-3 dataset with filename pattern Type_5000_20_n (with n in range 1-3), and these have approximately 5,000 nodes and 50,000 edges.

The NetEMDs produced by the two codebases are very different for some graph pairs. See r-v-python-comparison-1.xlsx for the 4 and 5 node graphlet NetEMD scores from both code bases.

I've started investigating further using the following approach. However, I am on leave from Fri 03 - Mon 06 Feb inclusive, so will not have completed this by the time Gesine and I meet on Tuesday 07 Feb.

Check that the Python and R uses of ORCA give the same orbit counts
Amend each code base to export not just the NetEMD for each pair of networks, but also the list of minimum EMDs and offsets for each graphlet orbit degree distribution.
Use this additional data to better understand where the source of the differences is. See below for my starter for 10 questions.
- Are the per-orbit minimum EMDs the same?
- If the minimum EMDs are the same, look at the way the NetEMD is calculated from the set of per-orbit minimum EMDs
- Are the minimum EMD offsets the same?
- If the offsets are the same, look at the way per-orbit EMD is calculated in each package
- If the offsets are different, do the packages agree on the EMD at both offsets?
- If the packages don't agree on the EMD at any particular offset, look at the way EMD is calculated in each package
- If the packages agree on the EMD at any particular offset, look at the optimisation method used by each package

martintoreilly commented 7 years ago

Amended R code to also permit export of the individual minimal EMDs and offsets for each graphlet orbit degree distribution alongside the NetEMD for each pair of graphs. See below for outputs of code for graphlet orbits up to 4 and 5 nodes. _table_ files allow easy comparison with output of Python code, _list_ files also include the individual graphlet orbit minimal EMD and offset data.

net_emds_list_node5_optimise.txt net_emds_table_node5_optimise.txt net_emds_list_node4_optimise.txt net_emds_table_node4_optimise.txt

See below for code used to generate this data.

library("netdist")
godd_type = "node5" # 'node4' or 'node5'
net_emd_method = "optimise"
net_emd_return_details <- TRUE
random_edges <- read_all_graphs_as_orca_edge_lists(
  system.file(package = "netdist", "extdata", "random"),
  format = "ncol", pattern = ".*5000.*")
num_networks <- length(random_edges)
num_features <- switch(EXPR = godd_type, 
       node4 = 15,
       node5 = 73,
       stop("godd_type not recognised"))
print(length(random_edges))
print("ORCA")
system.time(random_godd <- purrr::map(random_edges, godd, type = godd_type))
comp_spec <- graph_cross_comparison_spec(random_edges)
num_specs <- dim(comp_spec)[1]
print(num_specs)
print("NetEMD")
system.time(out <- purrr::simplify(
  purrr::map2(comp_spec$index_a, comp_spec$index_b, function(index_a, index_b) {
  net_emd(random_godd[[index_a]], random_godd[[index_b]], method = net_emd_method, return_details = net_emd_return_details)
})))
if(net_emd_return_details) {
  net_emds <- purrr::simplify(purrr::map(out, ~.$net_emd))
  min_emds <- matrix(purrr::simplify(purrr::map(out, ~.$min_emds)), ncol = num_features, byrow = TRUE)
  colnames(min_emds) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinEMD_O", .-1, sep = "")))
  min_offsets <- matrix(purrr::simplify(purrr::map(out, ~.$min_offsets)), ncol = num_features, byrow = TRUE)
  colnames(min_offsets) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinOffsets_O", .-1, sep = "")))
} else {
  net_emds <- out
}

# Output results
# net_emd cross-comparison table
net_emd_matrix <- matrix(data = 0, nrow = num_networks, ncol = num_networks);
net_emd_matrix[cbind(comp_spec$index_a, comp_spec$index_b)] <- net_emds
net_emd_matrix[cbind(comp_spec$index_b, comp_spec$index_a)] <- net_emds
rownames(net_emd_matrix) <- attr(random_edges, "names")
colnames(net_emd_matrix) <- attr(random_edges, "names")
fpath <- paste("~/net_emds_table_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_matrix, file = fpath, sep = "\t", row.names = TRUE, col.names = NA)
# net_emds list (with min_emd and offset details if calculated)
net_emd_list <- cbind(comp_spec, net_emds)
if(net_emd_return_details) {
  net_emd_list <- cbind(net_emd_list, min_emds, min_offsets)
}
fpath <- paste("~/net_emds_list_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_list, file = fpath, sep = "\t", row.names = FALSE, col.names = TRUE)
print("Done")

martintoreilly commented 7 years ago

Differences between Python code and R code

Conversion from orbit counts at each node to distribution of node density at each unique count

Python:
- Centres distribution locations around mean location and makes each location have a "bin" width of 1, rather than setting it as a point location (i.e. "bin" edges set to count - mean_count and count+1 - mean_count)
- Normalises bin heights to density rather than counts (i.e. "bin" heights set to proportion of nodes with each orbit count rather than count of nodes with each orbit count)
R:
- Uses point locations (i.e. "bins" have no width) and does not centre locations on mean
- Does not normalise counts to density (i.e. "bin" heights set to number of nodes with each orbit count). However, normalisation of distributions to reflect density rather than counts occurs later in the code that finds minimum EMD between pairs of distributions.

Calculating variance

Python: Uses a representation of the E[X^2] – E[X]^2 formulation, multiplying the density in each bin (i[2]) by the integral of X^2 over the bin ((i[0]**2+i[1]**2+i[0]*i[1])/3.0). The integral formulation is due to the treatment of the distribution as continuous after mapping onto a normalised histogram having bin widths equal to 1. Distributions have already been normalised to unit mass, so the i[2]density incorporates the 1/N terms in the expected values. return sum([(i[0]**2+i[1]**2+i[0]*i[1])*i[2]/3.0 for i in d])-mean(d)**2
R: Uses a representation of the sum((x-mean(x))^2)/N formulation. Treats the distribution as discrete, so assigns all mass to a single location rather than across a bin of width 1. Also assumes distribution is not normalised to unit mass (i.e. represents density), so includes division by total mass of distribution to incorporate the 1/N term in the mean and variance calculations.
```
histogram_variance <- function(bin_masses, bin_centres) {
mean_centre <- sum(bin_masses * bin_centres) / sum(bin_masses)
variance <- sum(bin_masses * (bin_centres - mean_centre)^2) / sum(bin_masses)
return(variance)
}
```
The E[X^2] – E[X]^2 formulation is known to be numerically unstable when the two elements are of similar magnitude. However, in this case, replacing the variance calculation with the E[(x-E[X])^2) formulation does not change any of the NetEMDs calculated (they are identical to 8 decimal places). Python E[(x-E[X])^2) formulation is:
```
m = mean(d)
return sum([i[2]*((i[1]-m)**2+(i[0]-m)**2+(i[1]-m)*(i[0]-m))/3.0 for i in d])
```

martintoreilly commented 7 years ago

After reviewing Python code with Anatol, the expectation is that the discrete vs continuous treatment is likely to be the cause of the differences in NetEMD output. Next step is to add option to R code to do same mapping to a continuous distribution as the Python code does and re-compare the outputs. Opened issue #20 to cover this work.

A useful comparison point is orbit O3 when comparing BA_5000_20_1 and DD1_5000_20_1 networks. Rescaled distributions agree in location and mass to 4 significant figures, but R code gives minimum EMD of 0.1652 at offset 0.1268, while Python code gives minimum EMD of 0.8196 at offset 1.805.

Python code

import pyximport
pyximport.install(reload_support=True)
import cyemdORB4 as cy
from scipy.optimize import minimize_scalar

orb = 3
counts1 = cy.readcount("/Users/moreilly/random_n=5000_d=20/BA_5000_20_1.countsO4",orb)
dis1 = cy.dis(counts1)
dis1_rescaled = cy.rescale(dis1)
dis1_cum = cy.cum(dis1_rescaled)

counts2 = cy.readcount("/Users/moreilly/random_n=5000_d=20/DD1_5000_20_1.countsO4",orb)
dis2 = cy.dis(counts2)
dis2_rescaled = cy.rescale(dis2)
dis2_cum = cy.cum(dis2_rescaled)

k,o = cy.KSAPF(dis1_cum, dis2_cum)
k,o

R code

# Helper functions
dhist_cumulative_mass <- function(dhist) {
  # Ensure histogram locations and masses are in ascending order of location
  sorted_indexes <- sort(dhist$locations, decreasing = FALSE, index.return = TRUE)$ix
  dhist$masses <- dhist$masses[sorted_indexes]
  dhist$locations <- dhist$locations[sorted_indexes]
  # Generate cumulative histogram masses
  dhist$cum_mass <- cumsum(dhist$masses)
  return(dhist)
}
# Load networks of interest
library("netdist")
godd_type = "node4" # 'node4' or 'node5'
base_dir <- system.file(package = "netdist", "extdata", "random")
edges1 <- read_orca_edge_list(file.path(base_dir, "BA_5000_20_1"), format = "ncol")
edges2 <- read_orca_edge_list(file.path(base_dir, "DD1_5000_20_1"), format = "ncol")
print("ORCA")
dhist1 <- godd(edges1, type = godd_type)
dhist2 <- godd(edges2, type = godd_type)

# Mean centre histograms
dis1 <- purrr::map(dhist1, ~normalise_dhist_mass(mean_centre_dhist(.)))
dis2 <- purrr::map(dhist2, ~normalise_dhist_mass(mean_centre_dhist(.)))
# Rescale locations so histogram has unit variance
dis1_rescaled <- purrr::map(dis1, ~normalise_dhist_variance(.))
dis2_rescaled <- purrr::map(dis2, ~normalise_dhist_variance(.))
# Generate cumulative histograms
dis1_cum <- purrr::map(dis1_rescaled, ~dhist_cumulative_mass(.))
dis2_cum <- purrr::map(dis2_rescaled, ~dhist_cumulative_mass(.))

# Optimise EMD
emd_offset <- function(offset) {
    emd(shift_dhist(dhist1, offset), dhist2)
  }
net_emd(dis1_cum$O3, dis2_cum$O3, return_details = TRUE)

martintoreilly commented 7 years ago

The initial investigation into the differences in output between the Python and R codes began in this issue as the outputs of both codes were checked as a first step to verifying that the R NetEMD code gives the correct output. In this initial investigation, it was identified that the Python code does "nearest-neighbour" smoothing by smoothing the mass out across a bin of width 1 around each discrete histogram location prior to rescaling.

However, this smoothing functionality has been added to the R-code and a reasonable set of initial tests passed. Therefore this issue has been opened to further investigate the remaining differences between the outputs of the Python and R codes.

Further investigation of the differences between the Python and R code outputs will be carried out under issue #21. However, this issue will remain open until we are confident that the R code gives the correct NetEMD output for the example networks.

alan-turing-institute / network-comparison