Open martintoreilly opened 7 years ago
I used both the existing Python code and the current R code (commit df8a9f6, snapshotted as branch r-v-python-comparison-1
) to generate NetEMDs between all pairings of 5,000 node / 20 density graphs in the package inst/extdata/random
folder at this commit (i.e. matching filename pattern .*5000.*
). These are the graphs from the RG1-3
dataset with filename pattern Type_5000_20_n
(with n
in range 1-3), and these have approximately 5,000 nodes and 50,000 edges.
The NetEMDs produced by the two codebases are very different for some graph pairs. See r-v-python-comparison-1.xlsx for the 4 and 5 node graphlet NetEMD scores from both code bases.
I've started investigating further using the following approach. However, I am on leave from Fri 03 - Mon 06 Feb inclusive, so will not have completed this by the time Gesine and I meet on Tuesday 07 Feb.
Amended R code to also permit export of the individual minimal EMDs and offsets for each graphlet orbit degree distribution alongside the NetEMD for each pair of graphs. See below for outputs of code for graphlet orbits up to 4 and 5 nodes. _table_
files allow easy comparison with output of Python code, _list_
files also include the individual graphlet orbit minimal EMD and offset data.
net_emds_list_node5_optimise.txt net_emds_table_node5_optimise.txt net_emds_list_node4_optimise.txt net_emds_table_node4_optimise.txt
See below for code used to generate this data.
library("netdist")
godd_type = "node5" # 'node4' or 'node5'
net_emd_method = "optimise"
net_emd_return_details <- TRUE
random_edges <- read_all_graphs_as_orca_edge_lists(
system.file(package = "netdist", "extdata", "random"),
format = "ncol", pattern = ".*5000.*")
num_networks <- length(random_edges)
num_features <- switch(EXPR = godd_type,
node4 = 15,
node5 = 73,
stop("godd_type not recognised"))
print(length(random_edges))
print("ORCA")
system.time(random_godd <- purrr::map(random_edges, godd, type = godd_type))
comp_spec <- graph_cross_comparison_spec(random_edges)
num_specs <- dim(comp_spec)[1]
print(num_specs)
print("NetEMD")
system.time(out <- purrr::simplify(
purrr::map2(comp_spec$index_a, comp_spec$index_b, function(index_a, index_b) {
net_emd(random_godd[[index_a]], random_godd[[index_b]], method = net_emd_method, return_details = net_emd_return_details)
})))
if(net_emd_return_details) {
net_emds <- purrr::simplify(purrr::map(out, ~.$net_emd))
min_emds <- matrix(purrr::simplify(purrr::map(out, ~.$min_emds)), ncol = num_features, byrow = TRUE)
colnames(min_emds) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinEMD_O", .-1, sep = "")))
min_offsets <- matrix(purrr::simplify(purrr::map(out, ~.$min_offsets)), ncol = num_features, byrow = TRUE)
colnames(min_offsets) <- purrr::simplify(purrr::map(1:num_features, ~paste("MinOffsets_O", .-1, sep = "")))
} else {
net_emds <- out
}
# Output results
# net_emd cross-comparison table
net_emd_matrix <- matrix(data = 0, nrow = num_networks, ncol = num_networks);
net_emd_matrix[cbind(comp_spec$index_a, comp_spec$index_b)] <- net_emds
net_emd_matrix[cbind(comp_spec$index_b, comp_spec$index_a)] <- net_emds
rownames(net_emd_matrix) <- attr(random_edges, "names")
colnames(net_emd_matrix) <- attr(random_edges, "names")
fpath <- paste("~/net_emds_table_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_matrix, file = fpath, sep = "\t", row.names = TRUE, col.names = NA)
# net_emds list (with min_emd and offset details if calculated)
net_emd_list <- cbind(comp_spec, net_emds)
if(net_emd_return_details) {
net_emd_list <- cbind(net_emd_list, min_emds, min_offsets)
}
fpath <- paste("~/net_emds_list_",godd_type, "_", net_emd_method, ".txt", sep = "")
write.table(net_emd_list, file = fpath, sep = "\t", row.names = FALSE, col.names = TRUE)
print("Done")
count - mean_count
and count+1 - mean_count
)i[2]
) by the integral of X^2 over the bin ((i[0]**2+i[1]**2+i[0]*i[1])/3.0
). The integral formulation is due to the treatment of the distribution as continuous after mapping onto a normalised histogram having bin widths equal to 1. Distributions have already been normalised to unit mass, so the i[2]
density incorporates the 1/N
terms in the expected values.
return sum([(i[0]**2+i[1]**2+i[0]*i[1])*i[2]/3.0 for i in d])-mean(d)**2
1/N
term in the mean and variance calculations.
histogram_variance <- function(bin_masses, bin_centres) {
mean_centre <- sum(bin_masses * bin_centres) / sum(bin_masses)
variance <- sum(bin_masses * (bin_centres - mean_centre)^2) / sum(bin_masses)
return(variance)
}
The E[X^2] – E[X]^2 formulation is known to be numerically unstable when the two elements are of similar magnitude. However, in this case, replacing the variance calculation with the E[(x-E[X])^2) formulation does not change any of the NetEMDs calculated (they are identical to 8 decimal places). Python E[(x-E[X])^2) formulation is:
m = mean(d)
return sum([i[2]*((i[1]-m)**2+(i[0]-m)**2+(i[1]-m)*(i[0]-m))/3.0 for i in d])
After reviewing Python code with Anatol, the expectation is that the discrete vs continuous treatment is likely to be the cause of the differences in NetEMD output. Next step is to add option to R code to do same mapping to a continuous distribution as the Python code does and re-compare the outputs. Opened issue #20 to cover this work.
A useful comparison point is orbit O3
when comparing BA_5000_20_1
and DD1_5000_20_1
networks. Rescaled distributions agree in location and mass to 4 significant figures, but R code gives minimum EMD of 0.1652 at offset 0.1268, while Python code gives minimum EMD of 0.8196 at offset 1.805.
import pyximport
pyximport.install(reload_support=True)
import cyemdORB4 as cy
from scipy.optimize import minimize_scalar
orb = 3
counts1 = cy.readcount("/Users/moreilly/random_n=5000_d=20/BA_5000_20_1.countsO4",orb)
dis1 = cy.dis(counts1)
dis1_rescaled = cy.rescale(dis1)
dis1_cum = cy.cum(dis1_rescaled)
counts2 = cy.readcount("/Users/moreilly/random_n=5000_d=20/DD1_5000_20_1.countsO4",orb)
dis2 = cy.dis(counts2)
dis2_rescaled = cy.rescale(dis2)
dis2_cum = cy.cum(dis2_rescaled)
k,o = cy.KSAPF(dis1_cum, dis2_cum)
k,o
# Helper functions
dhist_cumulative_mass <- function(dhist) {
# Ensure histogram locations and masses are in ascending order of location
sorted_indexes <- sort(dhist$locations, decreasing = FALSE, index.return = TRUE)$ix
dhist$masses <- dhist$masses[sorted_indexes]
dhist$locations <- dhist$locations[sorted_indexes]
# Generate cumulative histogram masses
dhist$cum_mass <- cumsum(dhist$masses)
return(dhist)
}
# Load networks of interest
library("netdist")
godd_type = "node4" # 'node4' or 'node5'
base_dir <- system.file(package = "netdist", "extdata", "random")
edges1 <- read_orca_edge_list(file.path(base_dir, "BA_5000_20_1"), format = "ncol")
edges2 <- read_orca_edge_list(file.path(base_dir, "DD1_5000_20_1"), format = "ncol")
print("ORCA")
dhist1 <- godd(edges1, type = godd_type)
dhist2 <- godd(edges2, type = godd_type)
# Mean centre histograms
dis1 <- purrr::map(dhist1, ~normalise_dhist_mass(mean_centre_dhist(.)))
dis2 <- purrr::map(dhist2, ~normalise_dhist_mass(mean_centre_dhist(.)))
# Rescale locations so histogram has unit variance
dis1_rescaled <- purrr::map(dis1, ~normalise_dhist_variance(.))
dis2_rescaled <- purrr::map(dis2, ~normalise_dhist_variance(.))
# Generate cumulative histograms
dis1_cum <- purrr::map(dis1_rescaled, ~dhist_cumulative_mass(.))
dis2_cum <- purrr::map(dis2_rescaled, ~dhist_cumulative_mass(.))
# Optimise EMD
emd_offset <- function(offset) {
emd(shift_dhist(dhist1, offset), dhist2)
}
net_emd(dis1_cum$O3, dis2_cum$O3, return_details = TRUE)
The initial investigation into the differences in output between the Python and R codes began in this issue as the outputs of both codes were checked as a first step to verifying that the R NetEMD code gives the correct output. In this initial investigation, it was identified that the Python code does "nearest-neighbour" smoothing by smoothing the mass out across a bin of width 1 around each discrete histogram location prior to rescaling.
However, this smoothing functionality has been added to the R-code and a reasonable set of initial tests passed. Therefore this issue has been opened to further investigate the remaining differences between the outputs of the Python and R codes.
Further investigation of the differences between the Python and R code outputs will be carried out under issue #21. However, this issue will remain open until we are confident that the R code gives the correct NetEMD output for the example networks.
Goal
To have confidence that the NetEMD implementation in the whole is producing the same results as the code the research team have been using to generate the NetEMD paper results.
Acceptance criteria