charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Entry tag [inline] is unable to optimize away most of the overhead #921

Closed ericjbohm closed 7 months ago

ericjbohm commented 8 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/921


I've been studying a fine grained code which attempts to use [inline] to reduce the overhead of computing across elements when they are known to be co-resident on the PE.

However, [inline] performs over an order of magnitude worse than if(proxy[i]->ckLocal()){ user function } else { proxy[i].entry()}

In the case of a marshaled call we go through the full process of constructing the message before we even test whether the target can be inlined. Then execution goes through CkSend with CK_SEND_INLINE set so that we ultimately land in CkSendMsgInline. Which will unpack the data we just packed, and call the function through its entry wrapper, with tracing disabled. In terms of runtime system overhead, the only real benefit to this scheme is that we don't take a trip through the scheduler. We're still incurring message pack and unpack overhead for data which doesn't actually have to go anywhere.

Is there a good reason why we need to abide by that restriction? Why can't we generate code which will test locality and make a function call to the user code, and only hit the rest of this overhead in the else case?

nikhil-jain commented 5 years ago

Original date: 2015-12-14 18:28:32


Another place we saw this being an issue is ROSS. Replacing an inline call by a pointer saving + C++ style call gave us upto 15% performance boost. (PHOLD benchmark is really sensitive for low remote case).

ericjbohm commented 5 years ago

Original date: 2015-12-15 19:46:17


On further study [inline] is stuck with the copy implementation under current semantics. Pass by value means that a copy must be made. It could be made slightly more efficient by not going through the full message construction process, but you are going to suffer copy overhead for [inline] unless we change what it means.

PhilMiller commented 5 years ago

Original date: 2015-12-15 19:54:28


I don't think that's strictly true. If the receiver takes a const T& then we can pass the object directly through. I think a wrapper to guarantee value semantics on the caller side (pass by const reference if possible, pass by value of a copy otherwise) shouldn't be too hard. Even in the non-const case, that saves us a copy in the [inline] std::vector case, since we would avoid copying data into a message and then immediately back out - it would go from the original vector passed by the caller, to the argument vector passed to the callee.

PhilMiller commented 5 years ago

Original date: 2015-12-15 20:05:40


Here's a rough draft of the proposed wrapper (untested code):

template <typename T>
struct const_or_copy
{
  const T& t;
  const_or_copy(const T& arg) : t(arg) { }
  operator T() { return t; }
  operator const T&() { return t; }
};

If we stick that around each argument that gets passed through, I think we'd be set for any parameter that the user's function ultimately declares as const.

ericjbohm commented 5 years ago

Original date: 2015-12-15 20:23:54


Good point on use of const as a way to support pass by reference when the user function takes const input. I think that allows the optimization to go forward without creating debugging confusion.

PhilMiller commented 5 years ago

Original date: 2015-12-18 21:41:14


So, looking at the RTS code on the delivery path, there is at least some good reason that we go to the trouble - tracing and LB accounting. If we don't want to screw up other functionality, we'll need to refactor things in the messaging infrastructure so that generated code can make that happen properly.

It looks like some of that very same instrumentation is only activated when not doing inline delivery - message creation tracing and the accounting in the LB database are only called if (type == CkDeliver_queue). That may just be to avoid incorrectly recording message send twice, though.

On further investigation, part of the necessary refactoring seems to have been done, for ChaNGa using [local] entry methods, based on comments around CkMigratable::timingBeforeCall and CkMigratable::timingAfterCall and the code in xi::Entry::genArrayDefs().

PhilMiller commented 5 years ago

Original date: 2016-01-03 03:27:09


On further analysis, I think the right thing to do for now may actually be to have the proxy's method take all arguments by const& and to always pass those along unmolested. This automatically handles the following concerns without spurious copies or instance constructions:

The cases this won't handle are as follows:

The move optimization can be deferred to later, since it's not necessary for the common copy elisions we're looking at right now.

The failure on non-const references may be bad, because it could force a copy on other cases. The solution to those is that the method should take its parameter by value, then the compiler-/charmxi-generated code will move() from the temporary or copy as needed.

We can test for move/rvalue support with

#if defined(__cpp_rvalue_references) && __cpp_rvalue_references >= 200610

And if it's not there, then just leave off the std::move wrapping arguments in the temporary case.

This means that we may need to distinguish [local] entry methods from others, since those should allow non-const references to pass through.

PhilMiller commented 5 years ago

Original date: 2016-01-04 02:36:19


So, there's some apparent differences in how various code wrapping array element entry method invocation works between [local] (implemented by charmxi) and [inline] (implemented across various bits of RTS) entry methods. Here's the order in which they each push various stuff on the stack, spread over many different functions currently:

[local] generated code             [inline] RTS code for array          BOC (inline/not)  deliver Chare msg   [inline] chare
                                   LB suspend current running object    LB suspend
                                   Grid Q
                                   LB start timing
Trace execution / Appwork          Trace execution / Appwork            Trace             Trace               (never trace?!)
LB suspend, start timing
Debug                              Debug                                Debug             Debug               Debug

Debug should clearly be last, since that's the cut-over point for running in user-provided code. I think the LB suspension should be first, since everything else that happens in this interval is legitimately overhead. I think tracing should be before LB starts timing, since the unfortunate event of a log flush should be attributed to the processor's background load, and not the object in question. So, my preferred order is as follows (where applicable, for each object type):

What I think I'd like to do about some/all of this is move all of that stuff into a virtual method on the object classes themselves (array element, Chare, Group), that both the generated code and the RTS can call. Thus, we simplify other code, consolidate these widely-spread concerns, and reduce duplication.

In looking at the current implementation, I also noticed that [local] entry methods don't set the event dependency field in tracing, so there's no possibility of traceback through those events. I opened #937 about that.

PhilMiller commented 5 years ago

Original date: 2016-01-04 19:15:50


Partial fix, just for arrays and not groups, pushed here: https://charm.cs.illinois.edu/gerrit/#/c/976/ https://github.com/UIUC-PPL/charm/commit/9fb32c50b57aae8cc82b959a49cd3ce12bd8e17b

ericjbohm commented 5 years ago

Original date: 2016-01-05 00:20:18


Testing of this change shows that it reduces the overhead of [inline] vs if(a.ckLocal)... by an order of magnitude. Former case was that [inline] took 20x time to completion vs manual locality test for a fine grained benchmark. Now it is only 2x. Profiling implicates SDAG constructs used in the [inline] based application level implementation in the remaining overhead. Will update with more detailed results later after more study and experimentation.

PhilMiller commented 5 years ago

Original date: 2016-01-18 00:16:39


Eric M or Nikhil: could one of you test PHOLD against the change linked above to see how performance looks now with [inline] compared to the manual pointer test mentioned above?

PhilMiller commented 5 years ago

Original date: 2016-06-20 20:07:17


As of commit e82cc92d6ba6a2f6a0dbfb155e55fd2e8dbfb822 (including change 976 linked in comment 9):

In the non-SDAG case, passing through by const & cleanly avoids copying for PE-local [inline] calls.

If the entry method definition takes its argument by value, we get a copy, as expected. If it takes its argument by non-const reference, we get a compilation error.

So, on to the SDAG case, I guess.

PhilMiller commented 5 years ago

Original date: 2016-06-20 22:58:31


This should take care of allowing argument passing to [inline] methods by const& to elide copies even through SDAG when the caller and callee are on the same PE: https://charm.cs.illinois.edu/gerrit/1282

PhilMiller commented 5 years ago

Original date: 2017-01-02 17:10:49


More involved wrapper definition that's been lingering on the whiteboard at Charmworks for too long, while this work has been on ice:

template <typename T>
struct arg_wrapper {
  T *t;
  char space[sizeof(T)] alignas(T); // Cray's compiler doesn't support alignas, as of version 8.5
  bool isOwned() { return t == &space; }

  arg_wrapper(const T& v)
  : t(&v)
  { }

  arg_wrapper(T&& v)
  : t(space)
  { new (space) T(move(v)); }

  ~arg_wrapper()
  {
    if (isOwned())
      t->~T();
  }

  operator const T& { return *t; }
  operator T() { return *t; }
  operator T&&()
  {
    if (!isOwned()) {
      new (space) T(*t);
      t = (T*)space; // Added per note in comment #19
    }

    return move(*t);
  }
};

In theory, this should be able to pass everything the right way through.

PhilMiller commented 5 years ago

Original date: 2017-03-09 17:49:57


Comment 18's definition of the rvalue conversion operator was missing t = space; in the non-owned case

nilsdeppe commented 5 years ago

Original date: 2017-09-18 17:59:01


If I've understood the above discussion correctly then this is the right place (otherwise I can create a new ticket). Looking through the generated hpp file I find:

<code class="cpp">
template < typename System >                                                                                                                     
template < typename ReceiveTag, typename ReceiveData_t >                                                                                         
void CkIndex_ElementChare < System > ::_call_receive_data_marshall5(void* impl_msg, void* impl_obj_void)                                         
{                                                                                                                                                
  ElementChare < System > * impl_obj = static_cast<ElementChare < System >  *>(impl_obj_void);                                                   
  CkMarshallMsg *impl_msg_typed=(CkMarshallMsg *)impl_msg;                                                                                       
  char *impl_buf=impl_msg_typed->msgBuf;                                                                                                         
  /*Unmarshall pup'd fields: const Instance_t &instance, const ReceiveData_t &t*/                                                                
  PUP::fromMem implP(impl_buf);                                                                                                                  
  Instance_t instance; implP|instance;                                                                                                           
  ReceiveData_t t; implP|t;                                                                                                                      
  impl_buf+=CK_ALIGN(implP.size(),16);                                                                                                           
  /*Unmarshall arrays:*/                                                                                                                         
  impl_obj->template receive_data< ReceiveTag, ReceiveData_t >(instance, t);                                                                     
} 
</code>

where the last call should be calling std::move() (which is trivial to write outside of the std namespace if necessary). To "emulate" this we currently move the instance and t we receive in ElementChare since we know that it should be an rvalue reference in C++11. However, [inline] entry methods break this (theoretically, it sounds like not in practice) since then I may be passed a legitimate lvalue reference, which I shouldn't move, but instead should either copy or just use once. This can be fixed by using forwarding references in the receive_data function of ElementChare, since a const-lvalue reference cannot be moved. However, it would be much less worrisome if this was handle correctly by calling move in the generated header file, so that using forwarding references with std::forward works correctly. By correctly I mean both in the sense of the standard and in the sense that we should not be passing lvalue references when rvalues are the correct answer.

stwhite91 commented 5 years ago

Original date: 2018-01-17 18:08:35


We may want to re-assign this to Evan? I think he might have some knowledge of this work, and I'm not sure Eric has time for it right now or the next few weeks

ericjbohm commented 5 years ago

Original date: 2018-04-04 17:14:01


Reassigning to Evan in the hopes that he will have the right mix of time and expertise.

evan-charmworks commented 5 years ago

Original date: 2018-04-05 21:30:36


I'm having trouble understanding what problem(s) still need fixing for this issue to be considered complete. It appears that the discussion has shifted since the issue description was originally written, possibly without the original matter being 100% resolved, and I'm unsure if (or how much) the most recent discussion relates to or overlaps with Bug #1699.

stwhite91 commented 5 years ago

Original date: 2018-04-05 21:45:31


Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:

  1. [inline] entry method
  2. [local] entry method
  3. Normal entry method
  4. Manually checking ckLocal() and then calling the C++ function directly
evan-charmworks commented 5 years ago

Original date: 2018-04-06 00:09:26


Is there any existing code to profile/time these?

stwhite91 commented 5 years ago

Original date: 2018-04-06 00:11:59


tests/charm++/pingpong has some of them, though I don't think it passes its arguments by const&

evan-charmworks commented 5 years ago

Original date: 2018-05-01 17:26:24


Sam White wrote:

Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:

  1. [inline] entry method
  2. [local] entry method
  3. Normal entry method
  4. Manually checking ckLocal() and then calling the C++ function directly

Should these test cases block v6.9 release?

stwhite91 commented 5 years ago

Original date: 2018-05-01 17:30:56


Not really, we can do the testing on the beta version of the release as part of our usual release testing.