benchmark-subsetting / cere

CERE: Codelet Extractor and REplayer
https://benchmark-subsetting.github.io/cere/
GNU Lesser General Public License v3.0
41 stars 22 forks source link

OpenMP NUMA first touch replay does not exactly reproduce the original behavior #190

Open mihailpopov opened 6 years ago

mihailpopov commented 6 years ago

Pages are allocated in NUMA systems with the lazy first touch policy: a page is mapped to the NUMA domain of the thread which first touches it. To ensure faithful codelets replay over NUMA systems, CERE must map the pages as they were in the original run.

At replay, CERE uses an OpenMP region to touch previously recorded pages with strncpy. While this method is more faithful to the original run than just touching all the pages from a serial region of code, it does not always faithfully reproduce the original mapping.

We did the following test to show that the current NUMA mapping is not correct. We focus on the parallel region rhs from SP OMP over 4 NUMA nodes. We consider 2 versions. First, we use a first touch file where all the pages are touched by the same master thread. Second, we unset CERE_FIRST_TOUCH to touch all the pages within a serial region. These two versions should have the same performance. Yet, the first is 25% faster.

A solution to address this issue is to use libnuma. In particular, the function "numa_move_pages" moves a page to a specific NUMA domain. This function can also be used to check the actual allocation of a page.

pablooliveira commented 6 years ago

Thanks for the report ! Do you know why the OpenMP region that touches pages is not always exact ?

I'm not sure to see why numa_move_pages is more accurate ?

mihailpopov commented 6 years ago

I think that the issue is how/when/over which pages we currently call strncpy to perform the first touch.

Using libnuma just provides more information (where the page is) and can actually move a page even if it was already touched before. So, I agree with you: if a page was not touched before, there is no difference between using strncpy and libnuma. I tested libnuma by mapping the pages on the second codelet iteration replay and got the same execution time as when we had unset CERE_FIRST_TOUCH.

In the current replay version, a thread touches its own pages with the call:

//strncpy(dest,src,len);
strncpy(buff,(char *)(address + read_bytes),PAGESIZE);

Where (char *)(address + read_bytes) is the address of the page that the function call touches.

In the first place, I through that this call is wrong and the function should be called instead as: strncpy((char *)(address + read_bytes),buff,PAGESIZE); since buff already contents the data from memdump. However, i did a quick checksum program and tested the two versions: the current replay returned the correct value but not the new one.

To summarize, I think that fixing this issue only with strncpy is the best solution but it requires to see why pages were touched differently. On the other side, libnuma allows to faithfully assign pages to cores but introduces both an overhead (due to page migration) and a library dependency. So libnuma is the quick fix.

pablooliveira commented 6 years ago

Thanks for the feedback :-) I agree with your analysis, if you want to contribute a PR you are more than welcome !

mihailpopov commented 6 years ago

Here is an update on this bug.

Currently, CERE NUMA behavior has two issues:

  1. Some pages are missing in the First Touch (FT) file
  2. The thread reported by CERE FT is wrong for heap allocated pages

Missing Pages First touch pages are currently dumped right before the parallel region that we capture. So, if a page is touched for the first time in the parallel region by thread 1, CERE will not consider it in its FT warmup process. This is an issue: CERE touches all the pages at replay. Therefore CERE will remap it to thread 0 instead of to thread 1.

To address this issue, simply do a FT dump at the end of the capture process (or capture invocation 2 as long as the sames pages are touched across different invocations).

Wrong first touch page information Data are allocated in different ways: malloc, stack, fixed addresses in binary (static)...
For heap allocated pages, the FT thread reported by CERE is wrong. Here are the details:

The capture performs two full memory locks, one at the start of the application and a second right before the parallel region. The first lock helps to identify when a page is accessed for the first time by a thread while the second is used for the dumping process. However, heap allocated pages are not yet allocated at the start of the application: the first lock misses them. Therefore, there can be undetected accesses to these pages before the parallel region.

To address this issue, CERE must override all the memory allocating functions to lock the heap allocated pages. In particular, mtrace (in tracee.c) must be activated. Not all pages allocated by malloc should be locked: use /proc/pid/maps to identify which address ranges must be avoided. A single call to /proc/pid/maps at the beginning of the application can detect these ranges. Then, tracer_lock_range must not be called with pages within these ranges.