ReScience / submissions

ReScience C submissions
28 stars 7 forks source link

[Re] Speedup Graph Processing by Graph Ordering #52

Closed lecfab closed 3 years ago

lecfab commented 3 years ago

Original article: Speedup Graph Processing by Graph Ordering by Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin, in Proceedings of SIGMOD 2016

PDF URL: repository/replication.pdf Metadata URL: repository/metadata.tex Code URL: github repository

Scientific domain: Algorithmics Programming language: C++, Python 3, Bash Suggested editor: Nicolas P. Rougier

lecfab commented 3 years ago

Dear all, This is a replication written with Lionel Tabourier and Maximilien Danisch. It includes the implementation and test of various graph algorithms. The code is mainly in C++ and we suggest Nicolas P. Rougier as the editor.

Best regards, Fabrice Lécuyer

rougier commented 3 years ago

@lecfab Thank you for this impressive submission. I'll edit it even though I'm not too familiar with the domain. Could you suggest some reviewer names from https://rescience.github.io/board/ or external reviewers?

lecfab commented 3 years ago

Hello @rougier, thanks for your reply. After looking into Rescience reviewers, we suggest : @EHadoux and @ozancaglayan. For people outside of Rescience whose research interests are closer to our replication, we suggest Antoine Amarilli (https://a3nm.net, a3nm@a3nm.net) or Stefano Leucci (https://www.stefanoleucci.com, stefano.leucci@univaq.it).

ozancaglayan commented 3 years ago

Thanks! I would be happy to review it.

rougier commented 3 years ago

@ozancaglayan Thanks. You can start your review while I look for second reviewer. @EHadoux Coudl you review this submission ?

EHadoux commented 3 years ago

Hey, let me have a look and I'll revert during the day!

EHadoux commented 3 years ago

Hey, sorry for the delay, I'd be happy to review it although I'm not familiar with the application domain. I'll refresh my memory on the procedure for reviewing. EDIT: Do you have a link to the original paper by any chance? The scholar one doesn't work http://www1.se.cuhk.edu.hk/~lucan/paper/sigmod16-speedup.pdf

lecfab commented 3 years ago

Hello, thank you for reviewing, here is a valid link (i updated my first message above): https://raw.githubusercontent.com/datourat/Gorder/master/paper.pdf

ozancaglayan commented 3 years ago

A preliminary review for the paper only, comments on the code and the reproducibility of the experimental pipeline will follow:

This paper replicates the results provided by the paper "Speedup Graph Processing by Graph Ordering by Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin, published in Proceedings of SIGMOD 2016.". The paper is very well written and is a pleasure to read, offering thorough analyses across different ordering methods explored in the original paper. The authors also incorporate a 9th "Random" ordering as the least favorable i.e. ~worst-case ordering. I particularly liked the way results are compared in Figure 6. I have very minor questions and suggestions in what follows, none of them is critical:

Questions:

Other remarks:

Page 3. shortest paths

Page 5. Energy formulations

Page 10 - line 3.

ozancaglayan commented 3 years ago

Running the code

Steps followed

The compilation of the code is straight-forward and pretty quick:

$ git clone --recurse-submodules https://github.com/lecfab/rescience-gorder.git
$ cd rescience-gorder/src
$ make

I tried to run the benchmark script on a Ubuntu 20.04.2 laptop booted with the Linux kernel version 5.8.0.

$ ./run-benchmark.sh
Cache sizes for each processor of this machine:
  Processor 0:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 1:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 2:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 3:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 4:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 5:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 6:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
  Processor 7:  L0-Data: 32K    L1-Instruction: 32K     L2-Unified: 256K    L3-Unified: 6144K   
Warning: impossible to measure cache in unknown machines; please install ocperf (see README) and edit this script to get cache measurements.
Experiments will soon start... (runtime will be measured but not cache-miss)
Results will be found in ../results/r4949 
<output stripped>

Remarks

The only configuration defined is mesu, which apparently is a cluster configuration that the authors used. I gave that a try as well:

$ ./run-benchmark.sh mesu
...
../pmu-tools/ocperf stat -e task-clock,cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,cycle_activity.cycles_l1d_pending,cycle_activity.cycles_l2_pending -o ../results/r2118/perf-epinion-original-nq.txt ./benchmark ../datasets/edgelist-epinion-75k-508k.txt -a nq -o ../results/r2118/time-epinion-original-nq.txt -l 10
Cannot run perf

Remarks

Let's gave the script another try after installing perf:

$ ./run-benchmark.sh mesu
../pmu-tools/ocperf stat -e task-clock,cpu-cycles,instructions,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,cycle_activity.cycles_l1d_pending,cycle_activity.cycles_l2_pending -o ../results/r2959/perf-epinion-original-nq.txt ./benchmark ../datasets/edgelist-epinion-75k-508k.txt -a nq -o ../results/r2959/time-epinion-original-nq.txt -l 10
event syntax error: '..oad-misses,cycle_activity.cycles_l1d_pending,cycle_act..'
                                  \___ parser error
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]
...
...
Warning: dataset pokec does not exist.
Warning: dataset flickr does not exist.
Warning: dataset livejournal does not exist.
Warning: dataset wiki does not exist.
Warning: dataset gplus does not exist.
Warning: dataset pldarc does not exist.
Warning: dataset twitter does not exist.
Warning: dataset sdarc does not exist.

Creating figures for orders comparison
Relatve runtime and ranking frequency
Warning: 9 datasets miss Gorder on NQ
Warning: 9 datasets miss Gorder on BFS
Warning: 9 datasets miss Gorder on DFS
Warning: 9 datasets miss Gorder on SCC
Warning: 9 datasets miss Gorder on SP
Warning: 9 datasets miss Gorder on PR
Warning: 9 datasets miss Gorder on DS
Warning: 9 datasets miss Gorder on Kcore
Warning: 9 datasets miss Gorder on Diam
Worse score  1
Image has been saved in /home/ozan/git/rescience-gorder/results/r2959/img-speedup.pdf
Image has been saved in /home/ozan/git/rescience-gorder/results/r2959/img-ranking.pdf
CPU execution and cache stall of Gorder and Original on sdarc
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-gorder-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-gorder-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-nq.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-nq.txt
nq [] []
Traceback (most recent call last):
  File "../results/gorder-cache-bars.py", line 31, in <module>
    gdata_stall.append(gperf[1] / operf[0])
IndexError: list index out of range
Cache-miss rates for pr on sdarc
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Error with /home/ozan/git/rescience-gorder/results/r2959/perf-sdarc-original-pr.txt
Traceback (most recent call last):
  File "../results/gorder-cache-table.py", line 47, in <module>
    billions(specs[0]),         # number of L1 references
IndexError: list index out of range

Remarks

Now apparently all the perf calls are failing because of the mesu config not being appropriate for the CPUs that I have. They fail on both Intel Xeon Silver 4116 and Intel Skylake i5-8250 CPU. This further leads to errors in the python scripts probably because there are no results produced for cache statistics.

NOTE: After looking through Google, I understand that the cycle_activity.* events are not supported on new Intel CPUs. The list of events for Skylake is here. On some other posts, I see that these events may not be reliable, i.e. 1, 2 So I wonder, whether there are some other events there, that are more robust and supported on newer CPUs? I should note that I am by no means an expert on these profiling stuff.

Comparing the results

I don't know what kind of information does this give but, I ran the benchmark twice (1) on my laptop and (2) on a server. In both cases, the system load was stable and low, no desktop environment or other processes were running. I created a plot that shows all 4 runs (2x laptop, 2x server) side by side compared to the plot in the paper, on the epinion dataset:

gorder

To me, the results look almost compatible across the board. Here are some questions:

(I know that, this is just on epinion but it still can show some interesting insights to us.)

EHadoux commented 3 years ago

I'll get started on it this evening, I'll overlook @ozancaglayan's review and any potential answer to it on purpose.

EHadoux commented 3 years ago

Here is a review of the paper, the review of running the code will come shortly.


This paper is the reproduction of Speedup Graph Processing by Graph Ordering by Wei et. al. In the original paper, the authors proposed an algorithm and graph agnostic node ordering procedure allowing for the reduction of CPU cache miss. The reproduction paper reproduces the experiment section on the same 9 algorithms and 8 datasets. It also contains an additional, smaller dataset.

Review

The paper reads nicely, addressing each point in turn in a non-confusing way, sometimes even adding more information than what the original paper contains. I appreciate that the authors have made their best efforts to reproduce the original implementation and, when not enough information was given, provided a close alternative instead of dropping it out altogether.

Comparing all the results, the authors draw the same as in the original paper, albeit to a milder extent concerning the performance of the Gorder method. It would have been interesting to discuss a bit more (if possible) the potential reasons for this difference.

Remarks

Questions

rougier commented 3 years ago

@ozancaglayan @EHadoux Waouh, many thanks for these very clean, fast and detailed review!!! We might break a record in time from submission to publication. @lecfab Could you address the comments and questions?

EHadoux commented 3 years ago

I'll add my review of the code most probably this evening.

lecfab commented 3 years ago

Hello all, thanks for the in-depth comments, i will get back to you when i update everything!

EHadoux commented 3 years ago

Review of the code and the instructions

General review

Past the very small number of hurdles (see below) the whole process is very smooth. The automatic generation of the charts is greatly appreciated.

I am confident that the results provided in the paper are a true depiction of the outputs of the code provided with it.

Specific remark

Download datasets in datasets/dl/ and use datasets/normalise.py to match the edgelist format. Note that epinion is readily available for quick testing.

Is the epinion file that's already there the output of normalise.py for this dataset? I assume yes because I can't use it as an input. Also, giving the direct links to the datasets would help. I found a link to a Flickr dataset http://konect.cc/networks/flickr-growth/ but I am not 100% sure it is the right one.

$ ./run-window.sh [REPEAT] tests different window sizes for Gorder on Flickr dataset The script actually runs on epinion, it needs to be slightly modified to run on Flickr. However the modification to make is already in the script, commented. It is extremely obvious how to change it, but one would need to open the source code.

Using the provided scripts only

run-window

I used 1 as a parameter unlike 100 as the authors reported due to the time it takes to run it, and stopped at a window size of 16384 (2^14 < 2^20 in the paper). You can see below the output generated. We can see the downward trend, combined with the chart on epinion (downwards then upwards), I have confidence the chart on Flickr would have looked like the one reported in the paper.

tune-w-flickr-pr tune-w-epinion-pr

run-annealing

It seems that the paper does not state which dataset has been used to generate Figure 3. I assumed it was epinion as it is the default setting in run-annealing.sh.

Although the units on the axes are different, the shapes for both graphs are very similar to the reported one.

Capture d’écran 2021-04-17 à 16 49 37

run-benchmark

I cannot run it on Mac as the script uses /proc/cpuinfo that does not exist on Mac. I'll run it if the authors provide an alternative script that would cater for Mac.

I have yet to run the ./benchmark but glancing at the code, I am confident that what has been committed has been generated by the C++ code, if the same Python/Bash scripts have been used to collate the results.

lecfab commented 3 years ago

Hello! Here are the first answers to your interesting remarks, and the corrections i made to the paper. The new version is not yet uploaded on github. Feel free to react to any of my comments below!

I will work on the code and your replications as soon as possible.

First updates after review

Remarks

Ozan Caglayan's comments

  1. Could bars in figure 5 be grouped by order instead of by dataset?

    Nice suggestion, I tried it and it turns out to be helpful to compare the efficiency of the orderings. It may be misleading though, because the time scale is not the same for different datasets. I will add this figure in supplementary material: img-speedup-grouped.pdf

  2. In simulated annealing, how is the variation of energy e related to the total energy EE?

    If indices of u and v are swapped, we could write (for MinLA):
    $$e=\sum_{w\in N_u} (|\pi_v-\pi_w| - |\pi_u-\pi_w|)  +  \sum_{w\in N_v} (|\pi_u-\pi_w| - |\pi_v-\pi_w|)$$
    The first sum is the energy variation for u, and the second is for v. In each sum, the first term adds the energy due to the new index and substracts the energy due to the old index. So e represents the change in the contribution of u and v in EE. Is this any clearer?

  3. Runtime and cache measurements could be separated

    Indeed it would clearer with separate scripts. However, the full experiment takes about a week, so we thought it was better to run them only once, and measure both time and cache-miss rates.

  4. Do you plot the mean or the median of 10 repetition runtimes?

    We plot the sum of 10 repetitions (which corresponds to averaging).

  5. Can some of the 10 repetitions be outliers because they include dataset loading?

    The algorithms are repeated 10 times after loading the graph in RAM, so this should not be the issue. Yet, we had many issues with fluctuations: on our machines, repeating the same procedure could have up to 20% of time variability (due to process competition, cache or RAM sharing, multithreading...). As we are not specialists for these issues, we ended up running everything on an isolated node (called mesu indeed!).

    Note that epinion is even more subject to fluctuations because it is small and algorithms only take a fraction of a second.

  6. It's curious that for NQ, the server with more CPUs slowed down the runtime of three algos. Are the implementations thread/process aware? For the paper, those algos apparently perform as good as Gorder since we don't have bars in the plot, right?

    It means that the runtimes are very similar (under 10% of variation). Note that epinion is probably too small to see the effect of wise ordering. On my machine, NQ is also the algorithm with most computation time (figure 1) and less cache stall, so the fluctuations on this small dataset may be more significant than the speedup due to ordering.

    Our code uses only one thread but is not protected from other threads, and that's maybe a reason why we had to use an isolated cluster to avoid heavy fluctuations. I believe the 6 bars above 2.2 are a sign of such unpredictable fluctuations.

  7. In overall, ChDFS, InDegSort and RCM seem to be the ones that are not that stable when compared to the paper, in terms of conclusions.

    I'm not sure: on your results, we see that these 3 orderings have small or invisible bars in either direction. It means that they are quite close to the performance of gorder and sometimes better. This sort of matches my results, though I think we reach the limit for epinion in term of analysis!

Emmanuel Hadoux's comments

  1. Why are ordering times so different between the original work and the replication? (table 2)

    While coding MinLA, MinLoga, Slashburn and LDG, we had to make a lot of choices of assumptions because the original paper did not explain much. (For instance, our Slashburn is a very simplified version, quite similar to Kcore). For this reason, the ordering time cannot be compared across the two papers. The remaining orderings (RCM, Deg, DFS) are quite standard so comparing is reasonable. Our Gorder was heavily based on the code provided by the original authors, so the comparison stands as well. With this restriction, our hardware is 2 to 5 times slower, which we think makes sense.

  2. Is there a speed/memory ratio below which this optimisation makes the improvement marginal?

    On most architectures, part of the data is physically closer to the processor and thus faster to retrieve. I'm really not an expert in hardware, but I would think that if processor speed is increased, most of the time will be spent in data retrieval (cache stall), which in figure 1 would make the grey bars tiny. In that case, cache optimisation is even more important. On the contrary, if the CPU slow, big amounts of distant data can be fetched while the CPU makes an operation, so cache optimisation is not meaningful.

  3. Comparing Figure 1 in both papers as well as both Table 3, it looks like the reproduction implementation has less cache misses overall

    I think the rates across machines cannot be compared without knowing the details of hardware architecture.

  4. In the « limits of the visualisation », is there a permutation of rankings beyond x1.5 that would make the overall ranking exactly the same between the two papers?

    No, indeed the approximation of cutting everything above x1.5 only affects the poor performers, and we see that good performers do not rank equally in both papers. To try and be fair, we gave the same rank to all the out-of-limits orderings (if there are 10 orderings and 6 of them are beyond x1.5, we all rank them fifth) so that Fig.6 is deterministic.

Paper

Major modifications are in brown in the updated paper of the repository.

Code

Ozan Caglayan's comments

  1. Add a section on top in the readme with the exact steps required to reproduce the paper

    I intended the Quick guide section to achieve this, could you please suggest what to add in it to complete it?

Emmanuel Hadoux's comments

  1. I used 1 as a parameter unlike 100 as the authors reported due to the time it takes to run it, and stopped at a window size of 16384 (2^14 < 2^20 in the paper). You can see below the output generated. We can see the downward trend, combined with the chart on epinion (downwards then upwards), I have confidence the chart on Flickr would have looked like the one reported in the paper.

    If possible time-wise, it would be interested to test with at least 3 repetitions (on flickr), to temper sudden fluctuations. I thik they are responsible for the big peaks in the figure.

  2. I cannot run it on Mac as the script uses /proc/cpuinfo that does not exist on Mac.

    Could you please provide the error message that this generates? A missing processor info should not stop the script, but i'm not familiar with mac (maybe some shell functions are different)

lecfab commented 3 years ago

@EHadoux: sorry for the mac-related problems! For run-benchmark.sh, i think it works if you remove all the display cache size information section (lines 4 to 21). If you want to use other datasets, there links are given in the replication paper (in table 1). I will add them to the readme though.

EHadoux commented 3 years ago

@EHadoux: sorry for the mac-related problems! For run-benchmark.sh, i think it works if you remove all the display cache size information section (lines 4 to 21).

If you want to use other datasets, there links are given in the replication paper (in table 1). I will add them to the readme though.

Cheers for that, I'll check it soon. For the concepts bits (I've seen you've put a ?), just in case it wasn't clear, you compile with the -fconcepts flag which is not available on Clang. If you don't use the notion of concepts in your code (I haven't checked), you can probably remove the flag altogether and not have to bother with instructions for Mac users.

lecfab commented 3 years ago

Thanks again for all your comments, @ozancaglayan @EHadoux. We think all remarks and questions have been taken into account in this new version. Please let us know if further improvements are possible.

rougier commented 3 years ago

@lecfab Thanks @ozancaglayan @EHadoux If you're satisfied with the corrections and answers, I think we can accept the paper. I'm waiting for you go and let me thank you again for your really nice and detailed reviews.

ozancaglayan commented 3 years ago

Hello,

Yes I am satisfied with the answers, thanks! Figure 5 now looks much better and the version in the supplement is a nice addition as well. Thank you for clarifying installation stuff in the README files and the scripts as well!

EHadoux commented 3 years ago

Hey, sorry for the delay, it's good to go on my side. Nice work everyone!

rougier commented 3 years ago

@lecfab Congratulations, your paper has been accepted! I'll edit it soon and come back to you.

rougier commented 3 years ago

I would need a link to your article sources

lecfab commented 3 years ago

Thank you all for this efficient process. Here is the [link]() to the article sources in overleaf.

rougier commented 3 years ago

Sorry for the delay. I edited your post to remove the link to overleaf because it is an editable link and anybody could modify your manuscript. Would it be possible for you to create a GitHub Repository with the source of thre article and using the latest template at https://github.com/ReScience/template. It would make my life easier for publication.

lecfab commented 3 years ago

https://github.com/lecfab/rescience-gorder/tree/main/paper

I filled the files, but could not compile them into a pdf because of the following errors. But i presume it will work on your side after completing the last pieces of information in metadata.yaml.

make latexmk -pdf -pdflatex="xelatex -interaction=nonstopmode" -use-make article.tex Latexmk: This is Latexmk, John Collins, 26 Dec. 2019, version: 4.67. Latexmk: applying rule 'pdflatex'... Rule 'pdflatex': The following rules & subrules became out-of-date: 'pdflatex' ------------ Run number 1 of rule 'pdflatex' ------------ Running 'xelatex -interaction=nonstopmode -recorder "article.tex"' ------------ sh: 1: xelatex: not found Latexmk: fls file doesn't appear to have been made. Latexmk: Errors, so I did not complete making targets Collected error summary (may duplicate other messages): pdflatex: Command for 'pdflatex' gave return code 127 Refer to 'article.log' for details ---------------------- This message may duplicate earlier message. Latexmk: Failure in processing file 'article.tex': (Pdf)LaTeX didn't generate the expected log file 'article.log' ---------------------- Latexmk: Use the -f option to force complete processing, unless error was exceeding maximum runs, or warnings treated as errors. make: *** [Makefile:30 : article.pdf] Erreur 12
rougier commented 3 years ago

It's working on my side so it is fine. Also, I forgot but can your register your code at https://www.softwareheritage.org/save-and-reference-research-software/ and post here the SWID ?

lecfab commented 3 years ago

swh:1:dir:e318a0ad72f81e2cb2af1ca614d1c171dd3f0909 https://archive.softwareheritage.org/swh:1:dir:e318a0ad72f81e2cb2af1ca614d1c171dd3f0909

rougier commented 3 years ago

Perfect, thanks. Can you check the article at https://sandbox.zenodo.org/record/830143 (this is a sandboxed version not yet published)?

lecfab commented 3 years ago

@rougier This looks good to me thanks!

rougier commented 3 years ago

Cool! It is now online at https://zenodo.org/record/4836230/ and will soon appear on ReScience C website. Congratulations!

lecfab commented 3 years ago

That was a very smooth process, and thank you for the in-depth reviews! Cheers everyone, Fabrice

rougier commented 3 years ago

I forgot to make a PR to your repo with the publication changes. Will do in a few minutes.

lecfab commented 3 years ago

I validated it and updated the replication.pdf !