Closed lecfab closed 3 years ago
Dear all, This is a replication written with Lionel Tabourier and Maximilien Danisch. It includes the implementation and test of various graph algorithms. The code is mainly in C++ and we suggest Nicolas P. Rougier as the editor.
Best regards, Fabrice Lécuyer
@lecfab Thank you for this impressive submission. I'll edit it even though I'm not too familiar with the domain. Could you suggest some reviewer names from https://rescience.github.io/board/ or external reviewers?
Hello @rougier, thanks for your reply. After looking into Rescience reviewers, we suggest : @EHadoux and @ozancaglayan. For people outside of Rescience whose research interests are closer to our replication, we suggest Antoine Amarilli (https://a3nm.net, a3nm@a3nm.net) or Stefano Leucci (https://www.stefanoleucci.com, stefano.leucci@univaq.it).
Thanks! I would be happy to review it.
@ozancaglayan Thanks. You can start your review while I look for second reviewer. @EHadoux Coudl you review this submission ?
Hey, let me have a look and I'll revert during the day!
Hey, sorry for the delay, I'd be happy to review it although I'm not familiar with the application domain. I'll refresh my memory on the procedure for reviewing. EDIT: Do you have a link to the original paper by any chance? The scholar one doesn't work http://www1.se.cuhk.edu.hk/~lucan/paper/sigmod16-speedup.pdf
Hello, thank you for reviewing, here is a valid link (i updated my first message above): https://raw.githubusercontent.com/datourat/Gorder/master/paper.pdf
A preliminary review for the paper only, comments on the code and the reproducibility of the experimental pipeline will follow:
This paper replicates the results provided by the paper "Speedup Graph Processing by Graph Ordering by Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin, published in Proceedings of SIGMOD 2016.". The paper is very well written and is a pleasure to read, offering thorough analyses across different ordering methods explored in the original paper. The authors also incorporate a 9th "Random" ordering as the least favorable i.e. ~worst-case ordering. I particularly liked the way results are compared in Figure 6. I have very minor questions and suggestions in what follows, none of them is critical:
Questions:
Table.3: I have a hard time to decipher the columns here. What is -r
and -mr
in columns 2, 4, and 5?
What does it mean that the "percentage columns respectively show the proportion of data that had to be retrieved beyond
L1, L2 and L3."? Although the text clarifies these, I think it would be good to rework the table caption a bit.
I was curious on how Fig.5 would look if the x-axis groups the orderings and the bars reflect the datasets. Maybe the plot will be easier to read with less oscillation of the bars? Just thinking loudly, never mind if this doesn't make sense.
Other remarks:
Page 3. shortest paths
m
denote here? It would be good to have it defined at the end of 2.1's beginning paragraph.Page 5. Energy formulations
e(abs(\pi_u - \pi_v))
with e() denoting the energy function, for example?Page 10 - line 3.
The compilation of the code is straight-forward and pretty quick:
$ git clone --recurse-submodules https://github.com/lecfab/rescience-gorder.git
$ cd rescience-gorder/src
$ make
I tried to run the benchmark script on a Ubuntu 20.04.2 laptop booted with the Linux kernel version 5.8.0.
$ ./run-benchmark.sh
Cache sizes for each processor of this machine:
Processor 0: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 1: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 2: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 3: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 4: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 5: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 6: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Processor 7: L0-Data: 32K L1-Instruction: 32K L2-Unified: 256K L3-Unified: 6144K
Warning: impossible to measure cache in unknown machines; please install ocperf (see README) and edit this script to get cache measurements.
Experiments will soon start... (runtime will be measured but not cache-miss)
Results will be found in ../results/r4949
<output stripped>
ocperf
should be installed and points us to the README file. However, there are no instructions in the README file for ocperf
installation. In fact, the ocperf
script comes from the pmu-tools
submodule which
is already checked out. I think the warning is misleading because the problem is about not passing a command-line
argument to the script to identify the machine configuration.The only configuration defined is mesu
, which apparently is a cluster configuration that the authors used. I gave that a try as well:
$ ./run-benchmark.sh mesu
...
../pmu-tools/ocperf stat -e task-clock,cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,cycle_activity.cycles_l1d_pending,cycle_activity.cycles_l2_pending -o ../results/r2118/perf-epinion-original-nq.txt ./benchmark ../datasets/edgelist-epinion-75k-508k.txt -a nq -o ../results/r2118/time-epinion-original-nq.txt -l 10
Cannot run perf
perf
util should be installed on the system. Could you add a simple check in the script
to detect whether perf
is installed or not and tell the user to install it using the distribution's package manager? For my kernel version, the following command did the trick: apt install linux-tools-5.8.0-48-generic linux-cloud-tools-5.8.0-48-generic
Let's gave the script another try after installing perf
:
$ ./run-benchmark.sh mesu
../pmu-tools/ocperf stat -e task-clock,cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,cycle_activity.cycles_l1d_pending,cycle_activity.cycles_l2_pending -o ../results/r2959/perf-epinion-original-nq.txt ./benchmark ../datasets/edgelist-epinion-75k-508k.txt -a nq -o ../results/r2959/time-epinion-original-nq.txt -l 10
event syntax error: '..oad-misses,cycle_activity.cycles_l1d_pending,cycle_act..'
\___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
...
...
Warning: dataset pokec does not exist.
Warning: dataset flickr does not exist.
Warning: dataset livejournal does not exist.
Warning: dataset wiki does not exist.
Warning: dataset gplus does not exist.
Warning: dataset pldarc does not exist.
Warning: dataset twitter does not exist.
Warning: dataset sdarc does not exist.
Creating figures for orders comparison
Relatve runtime and ranking frequency
Warning: 9 datasets miss Gorder on NQ
Warning: 9 datasets miss Gorder on BFS
Warning: 9 datasets miss Gorder on DFS
Warning: 9 datasets miss Gorder on SCC
Warning: 9 datasets miss Gorder on SP
Warning: 9 datasets miss Gorder on PR
Warning: 9 datasets miss Gorder on DS
Warning: 9 datasets miss Gorder on Kcore
Warning: 9 datasets miss Gorder on Diam
Worse score 1
Image has been saved in /home/ozan/git/rescience-gorder/results/r2959/img-speedup.pdf
Image has been saved in /home/ozan/git/rescience-gorder/results/r2959/img-ranking.pdf
CPU execution and cache stall of Gorder and Original on sdarc
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-gorder-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-gorder-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-nq.txt
nq [] []
Traceback (most recent call last):
File "../results/gorder-cache-bars.py", line 31, in <module>
gdata_stall.append(gperf[1] / operf[0])
IndexError: list index out of range
Cache-miss rates for pr on sdarc
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Traceback (most recent call last):
File "../results/gorder-cache-table.py", line 47, in <module>
billions(specs[0]), # number of L1 references
IndexError: list index out of range
Now apparently all the perf
calls are failing because of the mesu
config not being appropriate for the CPUs that I have.
They fail on both Intel Xeon Silver 4116 and Intel Skylake i5-8250 CPU. This further leads to errors in the python scripts
probably because there are no results produced for cache statistics.
perf
that'll work on most of the modern machines out there and
will produce a subset of the results at least?cycle_activity.*
events from perf calls and perf
began to work but only after setting a procfs
flag to -1
to allow regular users profiling or running the script as root. Could you add this into README as well? Although this created some cache-related stats as well, the python scripts still failed probably because of the missing perf events.gorder-cache-bars.py
and gorder-acche-table.py
actually computes the plots on the sdarc
dataset. However, we are only given the epinion
dataset in this repository. I though that at least we will be able to produce all sets of plots/graphics from the paper but only on the epinion
dataset. Could you clarify this? If epinion
doesn't make sense for some experiments simply it is a quite small/simple dataset, could you document where and how to get the other datasets in the README.md
file?NOTE: After looking through Google, I understand that the cycle_activity.*
events are not supported on new Intel CPUs. The list of events for Skylake is here. On some other posts, I see that these events may not be reliable, i.e. 1, 2 So I wonder, whether there are some other events there, that are more robust and supported on newer CPUs? I should note that I am by no means an expert on these profiling stuff.
I don't know what kind of information does this give but, I ran the benchmark twice (1) on my laptop and (2) on a server. In both cases, the system load was stable and low, no desktop environment or other processes were running. I created a plot that shows all 4 runs (2x laptop, 2x server) side by side compared to the plot in the paper, on the epinion
dataset:
To me, the results look almost compatible across the board. Here are some questions:
ChDFS, InDegSort and RCM
seem to be the ones that are not that stable when compared to the paper, in terms of conclusions.(I know that, this is just on epinion
but it still can show some interesting insights to us.)
I'll get started on it this evening, I'll overlook @ozancaglayan's review and any potential answer to it on purpose.
Here is a review of the paper, the review of running the code will come shortly.
This paper is the reproduction of Speedup Graph Processing by Graph Ordering by Wei et. al. In the original paper, the authors proposed an algorithm and graph agnostic node ordering procedure allowing for the reduction of CPU cache miss. The reproduction paper reproduces the experiment section on the same 9 algorithms and 8 datasets. It also contains an additional, smaller dataset.
The paper reads nicely, addressing each point in turn in a non-confusing way, sometimes even adding more information than what the original paper contains. I appreciate that the authors have made their best efforts to reproduce the original implementation and, when not enough information was given, provided a close alternative instead of dropping it out altogether.
Comparing all the results, the authors draw the same as in the original paper, albeit to a milder extent concerning the performance of the Gorder method. It would have been interesting to discuss a bit more (if possible) the potential reasons for this difference.
@ozancaglayan @EHadoux Waouh, many thanks for these very clean, fast and detailed review!!! We might break a record in time from submission to publication. @lecfab Could you address the comments and questions?
I'll add my review of the code most probably this evening.
Hello all, thanks for the in-depth comments, i will get back to you when i update everything!
Past the very small number of hurdles (see below) the whole process is very smooth. The automatic generation of the charts is greatly appreciated.
I am confident that the results provided in the paper are a true depiction of the outputs of the code provided with it.
Download datasets in datasets/dl/ and use datasets/normalise.py to match the edgelist format. Note that epinion is readily available for quick testing.
Is the epinion file that's already there the output of normalise.py
for this dataset? I assume yes because I can't use it as an input. Also, giving the direct links to the datasets would help. I found a link to a Flickr dataset http://konect.cc/networks/flickr-growth/ but I am not 100% sure it is the right one.
make
doesn't work on Mac with Clang because it seems that concepts aren't available. Installing gcc through brew install gcc
and modifying the Makefile solved the problem. Maybe adding these instructions or removing the concepts from the code if they are not mandatory.
The scripts require some Python packages (e.g. Matplotlib), providing a requirement.txt
that we can just pip install
would be good.
$ ./run-window.sh [REPEAT]
tests different window sizes for Gorder on Flickr dataset The script actually runs on epinion, it needs to be slightly modified to run on Flickr. However the modification to make is already in the script, commented. It is extremely obvious how to change it, but one would need to open the source code.
I used 1 as a parameter unlike 100 as the authors reported due to the time it takes to run it, and stopped at a window size of 16384 (2^14 < 2^20 in the paper). You can see below the output generated. We can see the downward trend, combined with the chart on epinion (downwards then upwards), I have confidence the chart on Flickr would have looked like the one reported in the paper.
It seems that the paper does not state which dataset has been used to generate Figure 3. I assumed it was epinion as it is the default setting in run-annealing.sh
.
Although the units on the axes are different, the shapes for both graphs are very similar to the reported one.
I cannot run it on Mac as the script uses /proc/cpuinfo that does not exist on Mac. I'll run it if the authors provide an alternative script that would cater for Mac.
I have yet to run the ./benchmark
but glancing at the code, I am confident that what has been committed has been generated by the C++ code, if the same Python/Bash scripts have been used to collate the results.
Hello! Here are the first answers to your interesting remarks, and the corrections i made to the paper. The new version is not yet uploaded on github. Feel free to react to any of my comments below!
I will work on the code and your replications as soon as possible.
Could bars in figure 5 be grouped by order instead of by dataset?
Nice suggestion, I tried it and it turns out to be helpful to compare the efficiency of the orderings. It may be misleading though, because the time scale is not the same for different datasets. I will add this figure in supplementary material: img-speedup-grouped.pdf
In simulated annealing, how is the variation of energy e related to the total energy EE?
If indices of u and v are swapped, we could write (for MinLA):
The first sum is the energy variation for u, and the second is for v. In each sum, the first term adds the energy due to the new index and substracts the energy due to the old index. So e represents the change in the contribution of u and v in EE. Is this any clearer?
Runtime and cache measurements could be separated
Indeed it would clearer with separate scripts. However, the full experiment takes about a week, so we thought it was better to run them only once, and measure both time and cache-miss rates.
Do you plot the mean or the median of 10 repetition runtimes?
We plot the sum of 10 repetitions (which corresponds to averaging).
Can some of the 10 repetitions be outliers because they include dataset loading?
The algorithms are repeated 10 times after loading the graph in RAM, so this should not be the issue. Yet, we had many issues with fluctuations: on our machines, repeating the same procedure could have up to 20% of time variability (due to process competition, cache or RAM sharing, multithreading...). As we are not specialists for these issues, we ended up running everything on an isolated node (called mesu indeed!).
Note that epinion is even more subject to fluctuations because it is small and algorithms only take a fraction of a second.
It's curious that for NQ, the server with more CPUs slowed down the runtime of three algos. Are the implementations thread/process aware? For the paper, those algos apparently perform as good as Gorder since we don't have bars in the plot, right?
It means that the runtimes are very similar (under 10% of variation). Note that epinion is probably too small to see the effect of wise ordering. On my machine, NQ is also the algorithm with most computation time (figure 1) and less cache stall, so the fluctuations on this small dataset may be more significant than the speedup due to ordering.
Our code uses only one thread but is not protected from other threads, and that's maybe a reason why we had to use an isolated cluster to avoid heavy fluctuations. I believe the 6 bars above 2.2 are a sign of such unpredictable fluctuations.
In overall, ChDFS, InDegSort and RCM seem to be the ones that are not that stable when compared to the paper, in terms of conclusions.
I'm not sure: on your results, we see that these 3 orderings have small or invisible bars in either direction. It means that they are quite close to the performance of gorder and sometimes better. This sort of matches my results, though I think we reach the limit for epinion in term of analysis!
Why are ordering times so different between the original work and the replication? (table 2)
While coding MinLA, MinLoga, Slashburn and LDG, we had to make a lot of choices of assumptions because the original paper did not explain much. (For instance, our Slashburn is a very simplified version, quite similar to Kcore). For this reason, the ordering time cannot be compared across the two papers. The remaining orderings (RCM, Deg, DFS) are quite standard so comparing is reasonable. Our Gorder was heavily based on the code provided by the original authors, so the comparison stands as well. With this restriction, our hardware is 2 to 5 times slower, which we think makes sense.
Is there a speed/memory ratio below which this optimisation makes the improvement marginal?
On most architectures, part of the data is physically closer to the processor and thus faster to retrieve. I'm really not an expert in hardware, but I would think that if processor speed is increased, most of the time will be spent in data retrieval (cache stall), which in figure 1 would make the grey bars tiny. In that case, cache optimisation is even more important. On the contrary, if the CPU slow, big amounts of distant data can be fetched while the CPU makes an operation, so cache optimisation is not meaningful.
Comparing Figure 1 in both papers as well as both Table 3, it looks like the reproduction implementation has less cache misses overall
I think the rates across machines cannot be compared without knowing the details of hardware architecture.
In the « limits of the visualisation », is there a permutation of rankings beyond x1.5 that would make the overall ranking exactly the same between the two papers?
No, indeed the approximation of cutting everything above x1.5 only affects the poor performers, and we see that good performers do not rank equally in both papers. To try and be fair, we gave the same rank to all the out-of-limits orderings (if there are 10 orderings and 6 of them are beyond x1.5, we all rank them fifth) so that Fig.6 is deterministic.
Major modifications are in brown in the updated paper of the repository.
MACHINE
with basic
and advanced
modes to activate different metrics. Basic is enough for Table 3 while Figure 1 requires advancedAdd a section on top in the readme with the exact steps required to reproduce the paper
I intended the Quick guide section to achieve this, could you please suggest what to add in it to complete it?
-fconcepts
in makefile (seems harmless)pip3 install matplotlib numpy
to readmeI used 1 as a parameter unlike 100 as the authors reported due to the time it takes to run it, and stopped at a window size of 16384 (2^14 < 2^20 in the paper). You can see below the output generated. We can see the downward trend, combined with the chart on epinion (downwards then upwards), I have confidence the chart on Flickr would have looked like the one reported in the paper.
If possible time-wise, it would be interested to test with at least 3 repetitions (on flickr), to temper sudden fluctuations. I thik they are responsible for the big peaks in the figure.
I cannot run it on Mac as the script uses /proc/cpuinfo that does not exist on Mac.
Could you please provide the error message that this generates? A missing processor info should not stop the script, but i'm not familiar with mac (maybe some shell functions are different)
@EHadoux: sorry for the mac-related problems! For run-benchmark.sh
, i think it works if you remove all the display cache size information section (lines 4 to 21).
If you want to use other datasets, there links are given in the replication paper (in table 1). I will add them to the readme though.
@EHadoux: sorry for the mac-related problems! For
run-benchmark.sh
, i think it works if you remove all the display cache size information section (lines 4 to 21).If you want to use other datasets, there links are given in the replication paper (in table 1). I will add them to the readme though.
Cheers for that, I'll check it soon. For the concepts bits (I've seen you've put a ?), just in case it wasn't clear, you compile with the -fconcepts flag which is not available on Clang. If you don't use the notion of concepts in your code (I haven't checked), you can probably remove the flag altogether and not have to bother with instructions for Mac users.
Thanks again for all your comments, @ozancaglayan @EHadoux. We think all remarks and questions have been taken into account in this new version. Please let us know if further improvements are possible.
@lecfab Thanks @ozancaglayan @EHadoux If you're satisfied with the corrections and answers, I think we can accept the paper. I'm waiting for you go and let me thank you again for your really nice and detailed reviews.
Hello,
Yes I am satisfied with the answers, thanks! Figure 5 now looks much better and the version in the supplement is a nice addition as well. Thank you for clarifying installation stuff in the README files and the scripts as well!
Hey, sorry for the delay, it's good to go on my side. Nice work everyone!
@lecfab Congratulations, your paper has been accepted! I'll edit it soon and come back to you.
I would need a link to your article sources
Thank you all for this efficient process. Here is the [link]() to the article sources in overleaf.
Sorry for the delay. I edited your post to remove the link to overleaf because it is an editable link and anybody could modify your manuscript. Would it be possible for you to create a GitHub Repository with the source of thre article and using the latest template at https://github.com/ReScience/template. It would make my life easier for publication.
https://github.com/lecfab/rescience-gorder/tree/main/paper
I filled the files, but could not compile them into a pdf because of the following errors. But i presume it will work on your side after completing the last pieces of information in metadata.yaml.
It's working on my side so it is fine. Also, I forgot but can your register your code at https://www.softwareheritage.org/save-and-reference-research-software/ and post here the SWID ?
swh:1:dir:e318a0ad72f81e2cb2af1ca614d1c171dd3f0909 https://archive.softwareheritage.org/swh:1:dir:e318a0ad72f81e2cb2af1ca614d1c171dd3f0909
Perfect, thanks. Can you check the article at https://sandbox.zenodo.org/record/830143 (this is a sandboxed version not yet published)?
@rougier This looks good to me thanks!
Cool! It is now online at https://zenodo.org/record/4836230/ and will soon appear on ReScience C website. Congratulations!
That was a very smooth process, and thank you for the in-depth reviews! Cheers everyone, Fabrice
I forgot to make a PR to your repo with the publication changes. Will do in a few minutes.
I validated it and updated the replication.pdf !
Original article: Speedup Graph Processing by Graph Ordering by Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin, in Proceedings of SIGMOD 2016
PDF URL: repository/replication.pdf Metadata URL: repository/metadata.tex Code URL: github repository
Scientific domain: Algorithmics Programming language: C++, Python 3, Bash Suggested editor: Nicolas P. Rougier