Open martintoreilly opened 7 years ago
Total time is time spent in min_emd_exhaustive
Run | Total | Create ECMF | ECMF area | All dist. | Positive dist. |
---|---|---|---|---|---|
1 | 19,220ms | 6,930ms (36.1%) | 3,020ms (20.3%) | 4,560ms (23.7%) | 4,110ms (21.4%) |
2 | 12,280ms | 4,480ms (36.5%) | 1,970ms (16.0%) | 2,920ms (23.7%) | 2,420ms (19.7%) |
3 | 15,740ms | 5,710ms (26.3%) | 2,610ms (16.6%) | 3,670ms (23.3%) | 3,210ms (20.4%) |
4 | 18,450ms | 6,690ms (36.2%) | 3,170ms (17.1%) | 4,230ms (22.9%) | 3,690ms (20.0%) |
5 | 10,580ms | 3,640ms (34.4%) | 1,890ms (17.9%) | 2,290ms (21.6%) | 2,400ms (22.7%) |
Amended code to keep a copy of the matrix of distances between all pairs of locations and the last shift. This allows us to update the distance matrix to find the next shift simply by subtracting the previous shift to all elements [commit 0d6c008].
Total time is time spent in min_emd_exhaustive
Run | Total | Create ECMF | ECMF area | All dist. | Positive dist. |
---|---|---|---|---|---|
1 | 1,310ms | 670ms (51.1%) | 300ms (22.9%) | 160ms (12.2%) | 130ms (9.9%) |
The above timings look suspicious. The code changes were restricted to potentially speeding up the "All dist." computation. It is conceivable that there may have been some effect on the "Positive dist." computation, but the apparent speedup in "Create ECMF" and "ECMF area" computations is basically impossible. The code to calculate these has not changes, not has the number of times these calculations need to be performed.
Reverted to the slow code [commit de1100d] to validate my initial pre-speedup timings and found an odd impact on execution speed depending on whether I call net_emd
(around 15s) vs calling the underlying min_emd
function directly (around 0.6s). Something funny is going on here.
Capturing net_emd
function here for reference.
net_emd <- function(dhists1, dhists2, method = "optimise",
return_details = FALSE, smoothing_window_width = 0) {
# Require either a pair of "dhist" discrete histograms or two lists of "dhist"
# discrete histograms
pair_of_dhist_lists <- all(purrr::map_lgl(dhists1, is_dhist)) && all(purrr::map_lgl(dhists2, is_dhist))
# If input is two lists of "dhist" discrete histograms, determine the minimum
# EMD and associated offset for pairs of histograms taken from the same
# position in each list
if(pair_of_dhist_lists) {
details <- purrr::map2(dhists1, dhists2, function(dhist1, dhist2) {
net_emd_single_pair(dhist1, dhist2, method = method,
smoothing_window_width = smoothing_window_width)
})
# Collect the minimum EMDs and associated offsets for all histogram pairs
min_emds <- purrr::simplify(purrr::transpose(details)$min_emd)
min_offsets <- purrr::simplify(purrr::transpose(details)$min_offset)
# The NetEMD is the arithmetic mean of the minimum EMDs for each pair of
# histograms
arithmetic_mean <- sum(min_emds) / length(min_emds)
net_emd <- arithmetic_mean
# Return just the NetEMD or a list including the NetEMD plus the details of
# the minumum EMD and associated offsets for the individual histograms
# Note that the offsets represent shifts after the histograms have been
# scaled to unit variance
if(return_details) {
return(list(net_emd = net_emd, min_emds = min_emds, min_offsets = min_offsets))
} else {
return(arithmetic_mean)
}
}
else {
# Wrap each member of a single pair of histograms is a list and recursively
# call this net_emd function. This ensures they are treated the same.
return(net_emd(list(dhists1), list(dhists2), method = method,
return_details = return_details,
smoothing_window_width = smoothing_window_width))
}
}
All time spent in net_emd
appears to be fully accounted for by the time spent calling min_emd_exhaustive
in the underlying call to min_emd
. See profiler screenshots below.
net_emd
spends 17,520ms calling net_emd_single_pair
...
... which spends 17,520ms calling min_emd
...
... which spends 17,520ms calling `min_emd_exhaustive ...
Hmmm. The difference between calling net_emd
and min_emd
has now basically disappeared, without me doing anything explicit. This coincides with the speed up code changes showing up in git as unstaged changes. I suspect something funny is going on when changing files that are open in RStudio when switching branches.
Edit: That was not the issue. The odd speed difference between calling net_emd
and calling min_emd
persists even after confirming that the code executed is the old "slow" code.
Same speed difference persists between calling net_emd_single_pair
and min_emd
, which completely rules out anything odd occurring with wrapping a single pair of histograms into two single-member lists of histograms in net_emd
.
Isolated the speed difference to the fact that, when called through net_emd_single_pair
, the discrete histograms are normalised to unit mass and unit variance via normalise_dhist_variance(normalise_dhist_mass(dhist))
. This operation itself does not take any appreciable time. However, exhaustively computing the minimal EMD appears to be an order of magnitude slower for histograms that have been normalised.
This can't be an integer vs floating point related speed difference, as the histograms are created with random floating point locations and masses.
Even with the speed up, the exhaustive calculation is prohibitively slow. Running the 276 NetEMD comparisons for our random graph test subset did not complete even when running for at least 4 solid days on a late 2015 retina iMac with 4-core 4GHz Intel Core i7 and 32GB RAM).
All pairings of 5,000 node / 20 density graphs in the R-package inst/extdata/random
folder at this R-code commit (i.e. matching filename pattern .5000.). These are the graphs from the RG1-3 dataset with filename pattern Type_5000_20_n (with n in range 1-3), and these have approximately 5,000 nodes and 50,000 edges.
The new exhaustive EMD minimisation method is much slower than both the optimise-based method and the previous fixed step method. Identify bottlenecks and any potential speed ups.
Likely potential candidates are:
dhist_ecmf
(approxfun
) object after each shift