alan-turing-institute / network-comparison

An R package implementing the NetEMD and NetDis network comparison measures
MIT License
14 stars 3 forks source link

Verify that correct NetEMD is generated for example networks (e.g. Random and PPI) #12

Open martintoreilly opened 7 years ago

martintoreilly commented 7 years ago

Goal

To have confidence that the NetEMD implementation in the whole is producing the same results as the code the research team have been using to generate the NetEMD paper results.

Acceptance criteria

martintoreilly commented 7 years ago

I used both the existing Python code and the current R code (commit df8a9f6, snapshotted as branch r-v-python-comparison-1) to generate NetEMDs between all pairings of 5,000 node / 20 density graphs in the package inst/extdata/random folder at this commit (i.e. matching filename pattern .*5000.*). These are the graphs from the RG1-3 dataset with filename pattern Type_5000_20_n (with n in range 1-3), and these have approximately 5,000 nodes and 50,000 edges.

The NetEMDs produced by the two codebases are very different for some graph pairs. See r-v-python-comparison-1.xlsx for the 4 and 5 node graphlet NetEMD scores from both code bases.

I've started investigating further using the following approach. However, I am on leave from Fri 03 - Mon 06 Feb inclusive, so will not have completed this by the time Gesine and I meet on Tuesday 07 Feb.

martintoreilly commented 7 years ago

Amended R code to also permit export of the individual minimal EMDs and offsets for each graphlet orbit degree distribution alongside the NetEMD for each pair of graphs. See below for outputs of code for graphlet orbits up to 4 and 5 nodes. _table_ files allow easy comparison with output of Python code, _list_ files also include the individual graphlet orbit minimal EMD and offset data.

net_emds_list_node5_optimise.txt net_emds_table_node5_optimise.txt net_emds_list_node4_optimise.txt net_emds_table_node4_optimise.txt

See below for code used to generate this data.

library("netdist")
godd_type = "node5" # 'node4' or 'node5'
net_emd_method = "optimise"
net_emd_return_details <- TRUE
random_edges <- read_all_graphs_as_orca_edge_lists(
  system.file(package = "netdist", "extdata", "random"),
  format = "ncol", pattern = ".*5000.*")
num_networks <- length(random_edges)
num_features <- switch(EXPR = godd_type, 
       node4 = 15,
       node5 = 73,
       stop("godd_type not recognised"))
print(length(random_edges))
print("ORCA")
system.time(random_godd <- purrr::map(random_edges, godd, type = godd_type))
comp_spec <- graph_cross_comparison_spec(random_edges)
num_specs <- dim(comp_spec)[1]
print(num_specs)
print("NetEMD")
system.time(out <- purrr::simplify(
  purrr::map2(comp_spec$index_a, comp_spec$index_b, function(index_a, index_b) {
  net_emd(random_godd[[index_a]], random_godd[[index_b]], method = net_emd_method, return_details = net_emd_return_details)
})))
if(net_emd_return_details) {
  net_emds <- purrr::simplify(purrr::map(out, ~.$net_emd))
  min_emds <- matrix(purrr::simplify(purrr::map(out, ~.$min_emds)), ncol = num_features, byrow = TRUE)
  colnames(min_emds) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinEMD_O", .-1, sep = "")))
  min_offsets <- matrix(purrr::simplify(purrr::map(out, ~.$min_offsets)), ncol = num_features, byrow = TRUE)
  colnames(min_offsets) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinOffsets_O", .-1, sep = "")))
} else {
  net_emds <- out
}

# Output results
# net_emd cross-comparison table
net_emd_matrix <- matrix(data = 0, nrow = num_networks, ncol = num_networks);
net_emd_matrix[cbind(comp_spec$index_a, comp_spec$index_b)] <- net_emds
net_emd_matrix[cbind(comp_spec$index_b, comp_spec$index_a)] <- net_emds
rownames(net_emd_matrix) <- attr(random_edges, "names")
colnames(net_emd_matrix) <- attr(random_edges, "names")
fpath <- paste("~/net_emds_table_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_matrix, file = fpath, sep = "\t", row.names = TRUE, col.names = NA)
# net_emds list (with min_emd and offset details if calculated)
net_emd_list <- cbind(comp_spec, net_emds)
if(net_emd_return_details) {
  net_emd_list <- cbind(net_emd_list, min_emds, min_offsets)
}
fpath <- paste("~/net_emds_list_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_list, file = fpath, sep = "\t", row.names = FALSE, col.names = TRUE)
print("Done")
martintoreilly commented 7 years ago

Differences between Python code and R code

Conversion from orbit counts at each node to distribution of node density at each unique count

Calculating variance

martintoreilly commented 7 years ago

After reviewing Python code with Anatol, the expectation is that the discrete vs continuous treatment is likely to be the cause of the differences in NetEMD output. Next step is to add option to R code to do same mapping to a continuous distribution as the Python code does and re-compare the outputs. Opened issue #20 to cover this work.

A useful comparison point is orbit O3 when comparing BA_5000_20_1 and DD1_5000_20_1 networks. Rescaled distributions agree in location and mass to 4 significant figures, but R code gives minimum EMD of 0.1652 at offset 0.1268, while Python code gives minimum EMD of 0.8196 at offset 1.805.

Python code

import pyximport
pyximport.install(reload_support=True)
import cyemdORB4 as cy
from scipy.optimize import minimize_scalar

orb = 3
counts1 = cy.readcount("/Users/moreilly/random_n=5000_d=20/BA_5000_20_1.countsO4",orb)
dis1 = cy.dis(counts1)
dis1_rescaled = cy.rescale(dis1)
dis1_cum = cy.cum(dis1_rescaled)

counts2 = cy.readcount("/Users/moreilly/random_n=5000_d=20/DD1_5000_20_1.countsO4",orb)
dis2 = cy.dis(counts2)
dis2_rescaled = cy.rescale(dis2)
dis2_cum = cy.cum(dis2_rescaled)

k,o = cy.KSAPF(dis1_cum, dis2_cum)
k,o

R code

# Helper functions
dhist_cumulative_mass <- function(dhist) {
  # Ensure histogram locations and masses are in ascending order of location
  sorted_indexes <- sort(dhist$locations, decreasing = FALSE, index.return = TRUE)$ix
  dhist$masses <- dhist$masses[sorted_indexes]
  dhist$locations <- dhist$locations[sorted_indexes]
  # Generate cumulative histogram masses
  dhist$cum_mass <- cumsum(dhist$masses)
  return(dhist)
}
# Load networks of interest
library("netdist")
godd_type = "node4" # 'node4' or 'node5'
base_dir <- system.file(package = "netdist", "extdata", "random")
edges1 <- read_orca_edge_list(file.path(base_dir, "BA_5000_20_1"), format = "ncol")
edges2 <- read_orca_edge_list(file.path(base_dir, "DD1_5000_20_1"), format = "ncol")
print("ORCA")
dhist1 <- godd(edges1, type = godd_type)
dhist2 <- godd(edges2, type = godd_type)

# Mean centre histograms
dis1 <- purrr::map(dhist1, ~normalise_dhist_mass(mean_centre_dhist(.)))
dis2 <- purrr::map(dhist2, ~normalise_dhist_mass(mean_centre_dhist(.)))
# Rescale locations so histogram has unit variance
dis1_rescaled <- purrr::map(dis1, ~normalise_dhist_variance(.))
dis2_rescaled <- purrr::map(dis2, ~normalise_dhist_variance(.))
# Generate cumulative histograms
dis1_cum <- purrr::map(dis1_rescaled, ~dhist_cumulative_mass(.))
dis2_cum <- purrr::map(dis2_rescaled, ~dhist_cumulative_mass(.))

# Optimise EMD
emd_offset <- function(offset) {
    emd(shift_dhist(dhist1, offset), dhist2)
  }
net_emd(dis1_cum$O3, dis2_cum$O3, return_details = TRUE)
martintoreilly commented 7 years ago

The initial investigation into the differences in output between the Python and R codes began in this issue as the outputs of both codes were checked as a first step to verifying that the R NetEMD code gives the correct output. In this initial investigation, it was identified that the Python code does "nearest-neighbour" smoothing by smoothing the mass out across a bin of width 1 around each discrete histogram location prior to rescaling.

However, this smoothing functionality has been added to the R-code and a reasonable set of initial tests passed. Therefore this issue has been opened to further investigate the remaining differences between the outputs of the Python and R codes.

Further investigation of the differences between the Python and R code outputs will be carried out under issue #21. However, this issue will remain open until we are confident that the R code gives the correct NetEMD output for the example networks.