proywm commented 7 years ago

mctop run is getting aborted. lstopo and cpuinfo output has been attached.

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 100.0% completed in 49.3 secs (step took 0.2 secs)

CDF Clusters

0 : size 3 / range 0 - 20 / median: 16

1 : size 16 / range 72 - 108 / median: 94

2 : size 162 / range 456 - 972 / median: 690

##############################################################

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct

########################################################################## mctop: src/mctop_topology.c:851: mctop_fix_n_hwcs_per_core_smt: Assertion `gs->type == CORE' failed. Aborted

OS configuration: Linux server 3.10.0-229.4.2.el7.x86_64 #1 SMP

cpuinfo.txt lstopo.txt

trigonak commented 7 years ago

Hello,

Thank you for reporting the issue. I just pushed a commit that should fix the problem. In brief, the latencies of hardware contexts were incorrectly clustered together with latency 0, thus not properly creating the separate group between a single context and the two hw contexts that are together in a core.

Please let me know if now mctop works properly on this machine.

Vasilis.

trigonak commented 7 years ago

By the way, there is huge variance across sockets: #2 : size 162 / range 456 - 972 / median: 690 and even the minimum 1-hop latency is 456 cycles, which is much higher than what I have ever seen.

I would be very interesting to see the output of mctop (if successful and you would like to share) :-)

proywm commented 7 years ago

Thanks for quick reply. Now its getting Segmentation fault. Sometimes it is causing "Floating point exception".

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 100.0% completed in 42.4 secs (step took 0.2 secs)

CDF Clusters

0 : size 1 / range 0 - 0 / median: 0

1 : size 1 / range 6 - 6 / median: 6

2 : size 17 / range 74 - 118 / median: 98

3 : size 174 / range 450 - 1010 / median: 706

##############################################################

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct

##########################################################################

Calculating cache latencies / sizes

Looking for L1 cache size (16 to 160 KB, step 16 KB)

CDF Clusters

0 : size 1 / range 0 - 6 / median: 6

############################################################## Segmentation fault

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 35.9% completed in 22.7 secs (step took 0.6 secs)Floating point exception

trigonak commented 7 years ago

From what I see, it crashed on "Calculating cache latencies / sizes," which is not an essential part of the topology creation. I disabled this step for now (through a commit) and I will try to investigate when I have the time.

I hope it will work this time :-)

Thanks for your help.

proywm commented 7 years ago

Unfortunately its still getting segfault. I have attached the generated mct file.

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct

########################################################################## Segmentation fault

Thanks, Probir

server.mct.txt

trigonak commented 7 years ago

I really don't like the numbers: They shouldn't be that high in a four socket processor. (If you replace 908 with 600 in the mct file, loading the topology works.)

Can you send me the output of: ./mctop -m0 -r5000 -f2 -v?

proywm commented 7 years ago

Here is the output. out.txt

trigonak commented 7 years ago

Thanks, it seems that there are a couple of measurements that are off:

Even if they were not off, the current implementation of mctop would not work on such an assymetric topology.

If you are not bored, one more test that you can run is with the manually fixed topology that I described before. In the server.mct file that you shared earlier, replace 908 with 600 and leave the file in the desc folder. Then, you can execute ./mctop -a to get memory latencies and bandwidths. These measurements will show to us whether the assymetry that we see truly exists.

Thanks again and sorry for all these complications!

One of the ideas that I want to implement at some point (so far I didn't have the need) is to have a backup plan if the topology creation fails: Read the topology from the OS and augment it with measurements.

Vasilis.

proywm commented 7 years ago

Here is the output. Seems they are symmetric, right?

latbw.txt

trigonak commented 7 years ago

Indeed, they are symmetric, with one weaker link each -- very interesting topology :)

My best "guess" is that the problem is due to:

DVFS -- the latency measurements have a huge spread; and
Due to the low actual difference (as indicated by the mem. latency measurements).

For now, I have made the DVFS handling more aggressive. You could try: ./mctop -f2 -c30 -r5000 -d3

I have some more proper solutions for multi-cores such as this one, but I need to find the time to implement them.

Thanks, once again.

trigonak commented 7 years ago

I forgot to mention the -i option of mctop. ./mctop -f2 -c30 -r5000 -d3 -i5 will explicitely try to find a clustering with 5 latency clusters...

proywm commented 7 years ago

Here is data. Please let me know your observation.

Thanks, Probir

out2.txt server.mct.txt

trigonak commented 7 years ago

Hi Probir,

The "good" news is that mctop did a very reasonable clustering. The bad news is that there are some outlier values that cannot be clustered together. I wrote two scripts to help us with debugging the problem:

You can invoke:

./scripts/ccbench.sh -x13 -y16 and then ./scripts/ccbench.sh -x14 -y16
./scripts/ccbench.sh -x13 -y50 and then ./scripts/ccbench.sh -x13 -y51 and then ./scripts/ccbench.sh -x14 -y50
./scripts/ccbench_map.sh

Essentially, we are measuring some problematic latencies manually, to figure out if it's mctop's problem.

proywm commented 7 years ago

out of scripts:

out_scripts.txt

trigonak commented 7 years ago

Well, it is not :-( Look at the latencies with Node 0:

0 <-> 1      1 <-> 0
526   376    382   536
0 <-> 2      2 <-> 0
611   503    485   583
0 <-> 3      3 <-> 0
492   335    378   547

It is faster for other nodes to receive data from Node 0 than for Node 0 to access other nodes.

Other than that:

   0   <-->   0
105.9
107.3
   0   <-->   1
526.6
376.2
   0   <-->   2
611.6
503.9
   0   <-->   3
492.4
335.2
   1   <-->   0
382.2
536.0
   1   <-->   1
120.6
121.6
   1   <-->   2
843.4
823.6
   1   <-->   3
781.9
728.2
   2   <-->   0
485.6
583.9
   2   <-->   1
824.8
843.0
   2   <-->   2
104.0
106.8
   2   <-->   3
786.1
771.8
   3   <-->   0
378.8
547.9
   3   <-->   1
766.2
812.3
   3   <-->   2
777.0
785.3
   3   <-->   3
97.7

the other nodes are quite reasonably connect.

Bottom line is that the current implementation of mctop does not support this type of asymmetry.

LPD-EPFL / mctop

Assertion `gs->type == CORE' failed #2

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 100.0% completed in 49.3 secs (step took 0.2 secs)

CDF Clusters

0 : size 3 / range 0 - 20 / median: 16

1 : size 16 / range 72 - 108 / median: 94

2 : size 162 / range 456 - 972 / median: 690

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 100.0% completed in 42.4 secs (step took 0.2 secs)

CDF Clusters

0 : size 1 / range 0 - 0 / median: 0

1 : size 1 / range 6 - 6 / median: 6

2 : size 17 / range 74 - 118 / median: 98

3 : size 174 / range 450 - 1010 / median: 706

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct

Calculating cache latencies / sizes

Looking for L1 cache size (16 to 160 KB, step 16 KB)

CDF Clusters

0 : size 1 / range 0 - 6 / median: 6

MCTOP Settings:

Machine name : server

Output : MCT description file

Repetitions : 2000

Do-memory : Latency+Bandwidth on topology

Mem. size bw : 512 MB

Cluster-offset : 20

Max std dev : 7

Cores : 64

Sockets : 4

Hint : 0 clusters

CPU DVFS : 1 (Freq. up in 160 ms)

Progress : 35.9% completed in 22.7 secs (step took 0.6 secs)Floating point exception

CPU is SMT: 1

Lat table

MCTOP output in: ./desc/server.mct