Mysticial / y-cruncher

Bug-Tracking and open-sourced parts of y-cruncher.
203 stars 15 forks source link

Node interleaving I/O buffer allocation in new RAID implementation results in memory corruption #21

Closed as-com closed 4 years ago

as-com commented 4 years ago

Steps to reproduce:

  1. Run I/O performance analysis on a dual socket system (?) running Linux (?)
  2. Set memory to desired size (I used small amounts like 1 GB)
  3. Ensure I/O buffer allocator is set to node interleaving
  4. Run it

The resulting run may have the following behaviors:

I have reproduced this issue with y-cruncher v0.7.8 on two different dual socket Westmere systems running different Linux distributions and kernel versions. This issue does not occur when the mmap allocator is used for I/O buffers,

Mysticial commented 4 years ago

Reproduced. Thanks for filing this!

Mysticial commented 4 years ago

I'm noticing that it also only happens when Cilk Plus is used. If you switch to anything else (including TBB), the issue goes away. But I still do not know the root cause.

as-com commented 4 years ago

I'm noticing it happening with Push Pool and multithreading disabled. You may need to tweak around the RAID configuration since it's... that kind of bug.

Mysticial commented 4 years ago

Root cause found. The Linux "NodeInterleaving" allocators were allocating the wrong amount of memory. Thus - buffer overrun and data corruption.

The code is correct in Windows where it was originally written in. But when ported to Linux, some bad copy-pasting led to the wrong size being used for the call to mmap(). This incorrect size doesn't take into account for the increased alignment that's required by the I/O buffers.

This slipped through testing because I don't have a dual-socket machine and I generally don't test Linux as heavily as Windows.

One line change. This will be fixed in the next patch. Thank you very much for reporting this!

EDIT: Only the new Disk Raid 0 requires greater than page alignment. The legacy Raid0+3 does not. Therefore this buffer overrun doesn't happen with the old code.

as-com commented 4 years ago

Closing as this is fixed in the latest release