fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

Linux - what tools can I use to analyze cache memory map ? #1281

Closed DrStein99 closed 6 years ago

DrStein99 commented 6 years ago

To see what my CPU layout is, I do:

cat /proc/cpuinfo

Which tells me (last core of the big long list):

processor       : 79
vendor_id       : GenuineIntel
cpu family      : 6
model           : 47
model name      : Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
stepping        : 2
microcode       : 0x37
cpu MHz         : 2395.000
cache size      : 30720 KB
physical id     : 3
siblings        : 20
core id         : 25
cpu cores       : 10
apicid          : 243
initial apicid  : 243
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm epb tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
bugs            : clflush_monitor
bogomips        : 4787.82
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

I can see core #79 is physically on CPU #3, and shares 30720k of cache memory with all the other cores for that CPU. I have spent over 20 hours doing tests and I am finding some confusing information.

Compared to E5-2600 series of CPU, I can devote XX core to X thread and when I use more than XXX amount of cache on the processor, the hash rate will drop on the whole test. This is NOT the same for the E7 4800 series of processors.

I am actually getting FASTER hash rates by using LESS cache memory. Of the 30k for each processor reported, when I only use 26k, the total hash rate is actually MORE than if I allocate for the whole 30k. On the FIRST processor, I actually peak hash rate using only 24k cache. When I apply the same core affinity configuration from one CPU to the next on the same system - the hash rate is different.

Can anyone here tell me any more tools / commands I can use in Linux to analyze the cores and cache memory map ? Is there a way to read how much cache is in use, and which cores are using how much of it?

Spudz76 commented 6 years ago

I get strange speed variances with an E5-2620 (15360 KB cache shown in cpuinfo) These are labeled as SmartCache, so whatever goofball things that does behind the scenes may be unknowable (other than this 'black box' probe method of trying various configs)...

I got ~262H/s running affinity 0/1/2/3/4/5 which should be the max because it's a six core w/HT and I stuck to the physical ones, but it leaves 2048KB of cache free...

So then I tossed a goofy idea in and added one more thread with affinity: false so it can roam around on the remaining cpus (assuming it naturally would avoid scheduling on the cores pegged at 100%, so the HT cores are its effective domain). Now runs ~280H/s with I suppose full cache utilization.

psychocrypt commented 6 years ago

hwloc-ls is the tool you should try

Spudz76 commented 6 years ago

^^nice, will try Also have odd/good results on some cpus using the appropriate total cache / 2048 formula count but set them all roaming affinity, especially on windows (not so much Linux / pretty sure I tried both on the same machine different OS too at least once)

kio3i0j9024vkoenio commented 6 years ago

Linux has strange layouts for processor cores compared to Windows.

I have a HP DL580 G7 server with 4x E7-4830 processors. Windows Pro was excluded because the server has four physical processors so my first attempt in Linux is with HiveOS. I have configured HiveOS for the DL580 rig to use XMR-Stak to mine on the 32 cores and 4x GTX 750 GPUs.

The following details my problems and the solution:


Hive OS (and probably Linux in general) assigns 10 cores to each processor from the E7-4800 and E7-8800 family. So Processor #0 has cores 0 - 9, Processor #1 has Cores 10 - 19, Processor #2 has Cores 20 - 29, and Processor #3 has Cores 10 - 39. I was very confused in why mpstat -P ALL showed 40 cores even though the 4x E7-4830 (with HT disabled) only had 32 (4x 8 cores). It turns out that Linux disables two of the cores in each set of 10 for the 8-core processors. The two cores disabled are not at the end but core #4 & #5 from each group. If you have 6-core processors two more cores would be disabled.

So your configuration from the link above and what I was using would try to use non-existent cores (which caused the affinity errors) and would run more threads on some processors than what we thought and less on other processors. That is why I saw hashes of 11.5 instead of 33.5 that I was expecting.

This is the correct CPU threads table for 8-core E7-4800's and E7-8800's (including the E7-8837):

"cpu_threads_conf" : [

    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1  },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2  },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3  },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 6  },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 7  },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 8  },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 9  },

    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 16 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 17 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 18 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 19 },

    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 20 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 21 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 22 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 23 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 26 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 27 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 28 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 29 },

    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 30 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 31 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 32 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 33 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 36 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 37 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 38 },
    { "low_power_mode" : true,  "no_prefetch" : true, "affine_to_cpu" : 39 },

],

Using this the "affinity errors" went away and my overall CPU hash rate went from 1038 H/s to 1358 Hs for the processors. A 31% increase with no additional power used.


DrStein99 commented 6 years ago

Does the 750 1gb get same hash as 750ti 2gb?


On my (4x) E7 4850 system, total hash I can squeeze is 1382.2. First 2 cores on each CPU are double, and the rest are single. I will try your map-plan and see what difference it makes.

This is a hunt-and-guess trial & error method of benchmarking xmr-stak and analyzing results for hours over the course of days. 1 single thread usually returns about 30h/s, so 2-thread core affinity would be about 60h/s. Once I start assigning 1 more core in the wrong spot, it SOME of the other cores dropping their hash down to sometimes into 20h/s. If I change it's affinity to one of the other 20 I have to pick from, I finally find myself just compromising to for the best rate since my patience to test was run out.

I just started looking at this "hwloc-ls" command. I believe it was probably designed for a GUI. I run server with bash-console only. I will have to read up on what this command is and what this report is giving me. I also suspect the RAM configuration could be interfering. If I have Single, Dual, or 4-rank modules, how many of the pairs are installed, etc...

Machine (31GB total)
  NUMANode L#0 (P#0 7962MB) + Package L#0 + L3 L#0 (24MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#40)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#4)
      PU L#3 (P#44)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#8)
      PU L#5 (P#48)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#12)
      PU L#7 (P#52)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
      PU L#8 (P#16)
      PU L#9 (P#56)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
      PU L#10 (P#20)
      PU L#11 (P#60)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
      PU L#12 (P#24)
      PU L#13 (P#64)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
      PU L#14 (P#28)
      PU L#15 (P#68)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
      PU L#16 (P#32)
      PU L#17 (P#72)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
      PU L#18 (P#36)
      PU L#19 (P#76)
  NUMANode L#1 (P#1 8061MB) + Package L#1 + L3 L#1 (24MB)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
      PU L#20 (P#1)
      PU L#21 (P#41)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
      PU L#22 (P#5)
      PU L#23 (P#45)
    L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
      PU L#24 (P#9)
      PU L#25 (P#49)
    L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
      PU L#26 (P#13)
      PU L#27 (P#53)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
      PU L#28 (P#17)
      PU L#29 (P#57)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
      PU L#30 (P#21)
      PU L#31 (P#61)
    L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
      PU L#32 (P#25)
      PU L#33 (P#65)
    L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
      PU L#34 (P#29)
      PU L#35 (P#69)
    L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
      PU L#36 (P#33)
      PU L#37 (P#73)
    L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
      PU L#38 (P#37)
      PU L#39 (P#77)
  NUMANode L#2 (P#2 8061MB) + Package L#2 + L3 L#2 (24MB)
    L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
      PU L#40 (P#2)
      PU L#41 (P#42)
    L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
      PU L#42 (P#6)
      PU L#43 (P#46)
    L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
      PU L#44 (P#10)
      PU L#45 (P#50)
    L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
      PU L#46 (P#14)
      PU L#47 (P#54)
    L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
      PU L#48 (P#18)
      PU L#49 (P#58)
    L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
      PU L#50 (P#22)
      PU L#51 (P#62)
    L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
      PU L#52 (P#26)
      PU L#53 (P#66)
    L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
      PU L#54 (P#30)
      PU L#55 (P#70)
    L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
      PU L#56 (P#34)
      PU L#57 (P#74)
    L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
      PU L#58 (P#38)
      PU L#59 (P#78)
  NUMANode L#3 (P#3 8060MB) + Package L#3 + L3 L#3 (24MB)
    L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
      PU L#60 (P#3)
      PU L#61 (P#43)
    L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
      PU L#62 (P#7)
      PU L#63 (P#47)
    L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
      PU L#64 (P#11)
      PU L#65 (P#51)
    L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
      PU L#66 (P#15)
      PU L#67 (P#55)
    L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
      PU L#68 (P#19)
      PU L#69 (P#59)
    L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
      PU L#70 (P#23)
      PU L#71 (P#63)
    L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
      PU L#72 (P#27)
      PU L#73 (P#67)
    L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
      PU L#74 (P#31)
      PU L#75 (P#71)
    L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
      PU L#76 (P#35)
      PU L#77 (P#75)
    L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
      PU L#78 (P#39)
      PU L#79 (P#79)
  HostBridge L#0
    PCIBridge
      PCI 14e4:1639
        Net L#0 "eno1"
      PCI 14e4:1639
        Net L#1 "eno2"
    PCIBridge
      PCIBridge
        PCIBridge
          PCI 1000:0079
            Block(Disk) L#2 "sda"
    PCIBridge
      PCI 102b:0532
    PCI 8086:3a20
      Block(Removable Media Device) L#3 "sr0"

This output is a little cryptic, but it gives me something else to look at, compared to "cat /proc/cpuinfo".

kio3i0j9024vkoenio commented 6 years ago

On my HP DL580 G7 with 4x E7-4830's I first turned off Hyper-threading in the BIOS. That way only REAL cores are seen and used. As stated above even though the E7-4830 have only eight cores and 4x times 8 is 32 Linux shows 40 cores numbered 0-39 with 10 cores per physical processor and core #4 & #5 of each set of 10 disabled. That is why you see in my setup those cores are missing.

http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-4830%20AT80615007089AA%20(BX80615E74830).html

Your E7-4870's have ten cores each so you do have all 40 cores.

http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-4870%20AT80615007263AA%20(BX80615E74870).html

With 4x E7-4830's and the above setup I was getting 1358 H/s. I then swapped out all the E7-4830's for E7-8837's and now do 1640 H/s with the same configuration.

http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-8837.html

I want to point out this blog when a fellow miner had lots of struggles with the E7-4870 and never got good hashing results and then switched to the E7-8837's.

http://www.cointainer.life/2018/03/10/say-l3-cache-king

The reason that the E7-8837 are so strong even compared to the E7-4870 is that the 8 cores of the E7-8837 are clocked at 2.8 GHz fully loaded whereas the E7-4870 with 8 or 10 cores fully loaded only runs at 2.53 GHz. The math shows that if you get all 10 cores on the E7-4870 working perfectly you would only be 13% faster than the E7-8837's.

8 2.8 = 22.4 10 2.53 = 25.3

(25.3 / 22.4) -1 = 12.95%

DrStein99 commented 6 years ago

I was actually struggling with the core clocks too. I read, once I disable cores in BIOS - the CPU boost / turbo MHZ would be re-distributed to the cores in use.

Thanks for the b-log link that explains alot.

kio3i0j9024vkoenio commented 6 years ago

It is not a good idea to have the mining threads auto arrange it is best to do explicit assignments via the CPU config file.

The only cores to disable in the BIOS is to turn off hyper-threading so that only real cores are exposed.

I reread your posts and I see that you do not have a E7-4870 but a E7-4850 so you have 10 real cores and a turbo clock of 2.133 GHz with 10 cores running full out.

http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-4850%20AT80615007449AA.html

Note in the above link: Maximum turbo frequency | 2400 MHz (2 cores) 2133 MHz (10 cores)

You have the optimal settings with 2 double threads (4MB L3 each) and 8 single threads (2MB L3 each). That adds up to 24MB L3 cache used which is what the E7-4850 has.

The only change I do is put the double threads at the end and not the front that way if the system needs to do some housekeeping or if you also mine with GPU's they tend to use lower core numbers. In fact if I do mine with GPUs I leave core 0 free as my config above shows.

Your hash rate of 1382 seems pretty much as best as you should get because of the low clock rate of 2.133 GHz on those 10 cores. Sometimes it is better to have a processor with fewer cores if it clocks higher.

My E7-4830's only had 8 cores but they were clocked at 2.267 GHz and produce 1358 H/s using 31 of the 32 cores.

If you want the best hash rate then swap out the E7-4850's for E7-8837's. You should then get around 1650 H/s.

If you do plan to do the swap it is a good idea to flash the system BIOS to the latest version so that it can boot with the E7-8837's.

On eBay I bought 20 E7-8837's for $218 to fill five HP DL-580 G7 systems. That works out to be $43.60 for 4x E7-8837's for each system.

Link to eBay E7-8837 auctions: https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=Xeon+E7-8837&_sop=15

DrStein99 commented 6 years ago

I "made offer" to person selling (4) 8837's, for $30 entire dollars. Probably the cheapest 300h/s I will ever find. They will replace my 4850's. Unfortunately there are only like 4 systems that can accept (4) E7 4800 CPU'S that I must snipe for the lunch money I expect to pay. Sometimes they put up for sale with only (2) cpu's & heat-sinks, and the 2 heat-sinks on Dell cost $50 each! Yuk.


I updated Dell bios to new version, and surprise - now I am losing 100h/s on the 4870 rig. Awesome. Now I know why it was probably sold to me from eBay with an older bios.


DL580 is an enormous monster! I would have to run extra breakers to my data room if I wanted to buy a few of those. And then consider routing my HVAC through them to heat my whole house from that exhaust.

kio3i0j9024vkoenio commented 6 years ago

Q: Does the 750 1gb get same hash as 750ti 2gb?

The memory size makes no difference. 1GB or 2GB for the same card results in no hash rate change.

However there is a difference in hashes between the 750 and the 750 Ti. I have two name brand 750 Ti's and they do about 290 H/s (these are overclocked).

On the 750's you should get about 240 H/s for name brand ones. However the ones I have bought from China are a mixed bag. The Asus ones I get 230 H/s the other unknown brands I get anywhere from 210 all the way down to 170 H/s. I expect that some of these have slower memory (DDR3 instead of GDDR5)

If you plan to buy you first need to be sure you really are buying a 750 or 750 Ti. If you see 192bit or a 6-pin power in the auction pictures or description than avoid at all costs as those are fake Fermi rebrands and not the Maxwell that the 750/750 Ti was made from.

Fake GTX 750 Ti graphics card scam on eBay https://www.youtube.com/watch?v=joVGTjB70dQ

kio3i0j9024vkoenio commented 6 years ago

DL580 is an enormous monster! I would have to run extra breakers to my data room if I wanted to buy a few of those. And then consider routing my HVAC through them to heat my whole house from that exhaust.


It is big but it also comes with a boatload of PCIe slots. 11 in total. Right now I am running only one DL580 (I plan to do three) on a dedicated 20 amp circuit. The power supplies can take 240 volts so I plan to run a 240 volt circuit just for my miners. Doing so will also improve my power usage as it will be balanced across both power legs and the current is lower at the plug. A perk is that the actual power supplies can do more watts efficiently. The 1200 watt DL580 power supplied can only do 900 watts at 120 VAC but the full 1200 watts at 240 VAC.

DrStein99 commented 6 years ago

How are your DIMMS setup on the DL580's? My Dell servers are throwing out "memory config not optimal" since I only have 2 DIMMS in each of the 8 slots. According to what I read so far, it changes how the memory is shared between processors, and could explain this cache memory sharing. I do not yet have 32 of the same type DIMMS to populate all my banks and test.

kio3i0j9024vkoenio commented 6 years ago

The HP DL580 G7 is not as finicky as Dell Servers are. Dell will disallow what other servers and Intel allows.

The DL580 G7 uses memory cartridges. Each cartridge can hold up to 8 registered DIMMS. There are two cartridges per processor but only one is required. Each installed cartridge must have a minimum of 2 DIMMS. DIMMS only run at 1067 MHz so PC3-8500R is all I need.

So my configuration is four cartridges (one per processor) containing two 2GB PC3-8500R DIMMS for a total of 16GB system memory.

HP made a beast of a server when it comes to memory. The maximum is 1024 GB of system memory (32GB DIMMS times 8 per cartridge times 8 cartridges.

I save 120 watts of power by only using four cartridges. Each cartridge burns 30 watts just by itself.

Take a look at the heatsinks inside a cartridge:

https://www.ebay.com/itm/HP-644172-B21-DL580G7-DL980G7-E7-Memory-Cartridge-647058-001-650761-001/302106091299

DrStein99 commented 6 years ago

I've got 4gb dimms, PC3L 2rx4. On the other system, is 4rx8 quad-rank and gives me different bios memory configuration options, which is where the 4870's are installed. I have options for memory interleaving (according to the guide) shares memory between processors. This is the problem as I see it, since I do not believe the bios option on the Dell presents itself until I populate the other banks of Dimms.

kio3i0j9024vkoenio commented 6 years ago

On your server make sure you enable NUMA memory mode. Any other things like interleaving or NODE sharing should be disabled along with memory mirroring or spare memory.

You mentioned earlier that there are only four systems that can use E7-4800 series processors. The HP DL580 G7 is one and I believe you have a Dell R810. What are the other two systems?

DrStein99 commented 6 years ago

I have (2) r810's. (1) with 4870's and (1) with 4850's. Both have only 8 dimm memory modules on them, and one of them has quad-rank dimms, which presents different memory options in the bios. I get "memory config not optional" error on both machines. I also get that error on my HP DL360e gen-8 server, since there is only (4) dimms of 4gb loaded on that too, but I do not get the cache-memory sharing mystery, but there are only (2) E5 2450 CPU'S on that rig.

I have a Dell t7600 workstation, I use at my desk to control everything else, with (2) e5 2670's, 32gb ram - and no issues sharing l3 cache on this one either, but it has 24mb cache - 2 processors and 32 cores total (with the hyperthreading on) so spreading out the affinities on this system was only a little tricky. I get about 980 h/s on this one, that quickly drops when I start using web browser or anything else.

A dell d7500 - which I was going to use for GPU's, since it has 3 useable double full length 16 size pci-e slots. I got confused by how big a pci slot I needed to pick out whatever GPU, so I just got the cheapest system with as many x16 size slots. For a whole $138 I paid with shipping, it's a big beast that isn't rack mountable, but is easy for me to open the hood and change stuff. Cooling isn't great, like the rack systems is with a hurricane type of airflow. This only has a crappy xeon 5650 that produces a whopping 245h/s. I have a gtx 1050 ti and a gtx 960. I am holding off with buying another GPU for that, since the two gpu's are making so little trying every different coin / algo I can test. Those 2 cards I think produce something like 200 h/s on xmr.

A HP dl380 g7 with a crappy xeon 5660, does like 220 or 240 h/s - is not really a miner, I am building a nas/san appliance out of it that mines on idle.

2 more core i5 desktop machines that get about 200 h/s. I proxy connect the 200h/s machines mine xmr, since Knife-Crash simply laughs "socket disconnect" at me for anything under 600 h/s - and won't allow me to connect via a proxy to combine the machines.

kio3i0j9024vkoenio commented 6 years ago

I guess my question was mis-understood.

You mentioned earlier that there are only four systems that can use E7-4800/E7-8800 series processors.

The HP DL580 G7 is one and I believe you have a Dell R810.

What are the other two systems that can use E7-4800/E7-8800 series processors.

DrStein99 commented 6 years ago

I have Dell r810's. There are only 2 other systems I know which use the same E7 socket series, and both of them you have to watch out for in case the motherboard is setup for the Westboro (E7550 not E7-4870 which is WestboroEX).

IBM makes a x3950 g5 that is much like the HP dl580, is a big fat 5U with the memory cartridges. It's a huge space waster, since it is enormous, and you can't put the double full length GPU cards in the many PCI expansion slots.

Dell also makes a r910 which is a 5u, and it's much like the same IBM x3950 g5, where it is huge, takes the memory cartridges, and can not fit full length PCIe cards.

Then there are BLADE systems, which are far less expensive, except - you have to find the main frame housings which cost a fortune to ship, and usually have to do so via semi-truck freighter, since it has to be crated because they are the size of a half-stack height entire rack cabinet. Blade units are too small to fit GPU cards in, they are a whole system in a blade cardrige that loads into the frame blade - with even more stuff all linked, like a billion network jacks and management monitoring cards. The blades themselves (HP and DELL) are less expensive than buying each independant rack server, but you will need to put a sub-panel breaker box in whatever room you intend to draw at least 220v electric from. I have not calculated the max watt/amp current draw on a loaded full frame blade rack, it is more than I am willing to get into until I find a lunch-money deal on the barebones frame and power supplies. I can not calculate the cost of the power filters, battery backup that would need to run that in the event your house electric becomes dirty, and then you would have to claim your insurance for the cost of losing 10+ blades in your home main frame due to an unexpected brown-out. In my case, I have to insure my hardware at new costs, since it is actually impossible for me to replace my hardware for what I sniped the lowest used costs on eBay - there are simply no more available to buy.

If your interested in seeing them, check around your area if you live in a city - find a co-location data center, ask them for a tour. You can see the floors of racks stocked with all this stuff from floor to ceiling isles deep. The one by me in Philadelphia runs has floors for Verizon and Comcast. It's 10-gigabit speed direct to internet there, so renting a rack you put your own hardware in - that is covered and guarantee'd by their electric uptime without having to break your own bank over the cost of buying battery electric backup and generators, or melting your house wires.


I got the 4837's and put them in my Dell r810. It took me all of 15 entire minutes after install to get the 1,600 h/s. Far less effort than playing with these REDICULOUS 4870's. I swap the memory in those, change any bios setting and I have to re-do xmr-stak affinities all over again, the hash rate I can never seem to get above 1560.

kio3i0j9024vkoenio commented 6 years ago

Thanks for the details and analysis on each of these systems.

DrStein99 commented 6 years ago

I am still trying to research the issue with the E7-4870. The material on this is tough to gather, since much of it goes into a very lengthy explanation and many of the systems involved go over my head.

Looking closer at the 8837 - it was designed for 8-cpu rig. I have only found 1 system that supports 8 cpu's, and obviously tried to go buy that one. Unfortunately, it was super expensive and found only 1 that was sold. I lost my notes, from what I recall it was an HP. It is because of the 8-cpu design, I feel those cache memory paths are organized a different way -vs- the 4870. They do make 8870 e7's, that I have not tested, which are rare - and I do not know if they could perform the same great way the 8837's do.

There are SMI links, and also complicated MEMORY configurations. On my DELL systems, I have DIFFERENT bios options depending on HOW my memory is installed, which RANKS are installed, what type, etc... There are at least 4 different options, that I can only test if I have like 32 of every type of PC3 ram DIMM that is made. I am also getting "LOCKSTEP" error bits on my diagnostics too - which could also attribute to the mystery.

There are also a whole bunch of other CPU's with much greater power, which are too expensive to take a gamble to test for mining purposes only. I am trying to reach out to my data-center friends to see if any of them have servers I can test in their downtime.