Closed ericjbohm closed 7 months ago
Original date: 2015-12-14 18:28:32
Another place we saw this being an issue is ROSS. Replacing an inline call by a pointer saving + C++ style call gave us upto 15% performance boost. (PHOLD benchmark is really sensitive for low remote case).
Original date: 2015-12-15 19:46:17
On further study [inline] is stuck with the copy implementation under current semantics. Pass by value means that a copy must be made. It could be made slightly more efficient by not going through the full message construction process, but you are going to suffer copy overhead for [inline] unless we change what it means.
Original date: 2015-12-15 19:54:28
I don't think that's strictly true. If the receiver takes a const T&
then we can pass the object directly through. I think a wrapper to guarantee value semantics on the caller side (pass by const reference if possible, pass by value of a copy otherwise) shouldn't be too hard. Even in the non-const case, that saves us a copy in the [inline] std::vector
case, since we would avoid copying data into a message and then immediately back out - it would go from the original vector passed by the caller, to the argument vector passed to the callee.
Original date: 2015-12-15 20:05:40
Here's a rough draft of the proposed wrapper (untested code):
template <typename T>
struct const_or_copy
{
const T& t;
const_or_copy(const T& arg) : t(arg) { }
operator T() { return t; }
operator const T&() { return t; }
};
If we stick that around each argument that gets passed through, I think we'd be set for any parameter that the user's function ultimately declares as const.
Original date: 2015-12-15 20:23:54
Good point on use of const as a way to support pass by reference when the user function takes const input. I think that allows the optimization to go forward without creating debugging confusion.
Original date: 2015-12-18 21:41:14
So, looking at the RTS code on the delivery path, there is at least some good reason that we go to the trouble - tracing and LB accounting. If we don't want to screw up other functionality, we'll need to refactor things in the messaging infrastructure so that generated code can make that happen properly.
It looks like some of that very same instrumentation is only activated when not doing inline delivery - message creation tracing and the accounting in the LB database are only called if (type == CkDeliver_queue)
. That may just be to avoid incorrectly recording message send twice, though.
On further investigation, part of the necessary refactoring seems to have been done, for ChaNGa using [local]
entry methods, based on comments around CkMigratable::timingBeforeCall
and CkMigratable::timingAfterCall
and the code in xi::Entry::genArrayDefs()
.
Original date: 2016-01-03 03:27:09
On further analysis, I think the right thing to do for now may actually be to have the proxy's method take all arguments by const&
and to always pass those along unmolested. This automatically handles the following concerns without spurious copies or instance constructions:
A
to parameters of type B
(not handled by the above code, without further modification)The cases this won't handle are as follows:
const
reference, leading to compilation failure. This might occur in practice, because existing code generation creates temporary instances that the method may want to manipulate in place, and allow to be discarded after it returns.move()
'd-from variables) to const lvalue referencesThe move optimization can be deferred to later, since it's not necessary for the common copy elisions we're looking at right now.
The failure on non-const references may be bad, because it could force a copy on other cases. The solution to those is that the method should take its parameter by value, then the compiler-/charmxi-generated code will move()
from the temporary or copy as needed.
We can test for move/rvalue support with
#if defined(__cpp_rvalue_references) && __cpp_rvalue_references >= 200610
And if it's not there, then just leave off the std::move
wrapping arguments in the temporary case.
This means that we may need to distinguish [local]
entry methods from others, since those should allow non-const
references to pass through.
Original date: 2016-01-04 02:36:19
So, there's some apparent differences in how various code wrapping array element entry method invocation works between [local]
(implemented by charmxi) and [inline]
(implemented across various bits of RTS) entry methods. Here's the order in which they each push various stuff on the stack, spread over many different functions currently:
[local] generated code [inline] RTS code for array BOC (inline/not) deliver Chare msg [inline] chare
LB suspend current running object LB suspend
Grid Q
LB start timing
Trace execution / Appwork Trace execution / Appwork Trace Trace (never trace?!)
LB suspend, start timing
Debug Debug Debug Debug Debug
Debug should clearly be last, since that's the cut-over point for running in user-provided code. I think the LB suspension should be first, since everything else that happens in this interval is legitimately overhead. I think tracing should be before LB starts timing, since the unfortunate event of a log flush should be attributed to the processor's background load, and not the object in question. So, my preferred order is as follows (where applicable, for each object type):
What I think I'd like to do about some/all of this is move all of that stuff into a virtual method on the object classes themselves (array element, Chare, Group), that both the generated code and the RTS can call. Thus, we simplify other code, consolidate these widely-spread concerns, and reduce duplication.
In looking at the current implementation, I also noticed that [local]
entry methods don't set the event dependency field in tracing, so there's no possibility of traceback through those events. I opened #937 about that.
Original date: 2016-01-04 19:15:50
Partial fix, just for arrays and not groups, pushed here: https://charm.cs.illinois.edu/gerrit/#/c/976/ https://github.com/UIUC-PPL/charm/commit/9fb32c50b57aae8cc82b959a49cd3ce12bd8e17b
Original date: 2016-01-05 00:20:18
Testing of this change shows that it reduces the overhead of [inline] vs if(a.ckLocal)... by an order of magnitude. Former case was that [inline] took 20x time to completion vs manual locality test for a fine grained benchmark. Now it is only 2x. Profiling implicates SDAG constructs used in the [inline] based application level implementation in the remaining overhead. Will update with more detailed results later after more study and experimentation.
Original date: 2016-01-18 00:16:39
Eric M or Nikhil: could one of you test PHOLD against the change linked above to see how performance looks now with [inline]
compared to the manual pointer test mentioned above?
Original date: 2016-06-20 20:07:17
As of commit e82cc92d6ba6a2f6a0dbfb155e55fd2e8dbfb822 (including change 976 linked in comment 9):
In the non-SDAG case, passing through by const &
cleanly avoids copying for PE-local [inline]
calls.
If the entry method definition takes its argument by value, we get a copy, as expected. If it takes its argument by non-const reference, we get a compilation error.
So, on to the SDAG case, I guess.
Original date: 2016-06-20 22:58:31
This should take care of allowing argument passing to [inline] methods by const& to elide copies even through SDAG when the caller and callee are on the same PE: https://charm.cs.illinois.edu/gerrit/1282
Original date: 2017-01-02 17:10:49
More involved wrapper definition that's been lingering on the whiteboard at Charmworks for too long, while this work has been on ice:
template <typename T>
struct arg_wrapper {
T *t;
char space[sizeof(T)] alignas(T); // Cray's compiler doesn't support alignas, as of version 8.5
bool isOwned() { return t == &space; }
arg_wrapper(const T& v)
: t(&v)
{ }
arg_wrapper(T&& v)
: t(space)
{ new (space) T(move(v)); }
~arg_wrapper()
{
if (isOwned())
t->~T();
}
operator const T& { return *t; }
operator T() { return *t; }
operator T&&()
{
if (!isOwned()) {
new (space) T(*t);
t = (T*)space; // Added per note in comment #19
}
return move(*t);
}
};
In theory, this should be able to pass everything the right way through.
Original date: 2017-03-09 17:49:57
Comment 18's definition of the rvalue conversion operator was missing t = space;
in the non-owned case
Original date: 2017-09-18 17:59:01
If I've understood the above discussion correctly then this is the right place (otherwise I can create a new ticket). Looking through the generated hpp file I find:
<code class="cpp">
template < typename System >
template < typename ReceiveTag, typename ReceiveData_t >
void CkIndex_ElementChare < System > ::_call_receive_data_marshall5(void* impl_msg, void* impl_obj_void)
{
ElementChare < System > * impl_obj = static_cast<ElementChare < System > *>(impl_obj_void);
CkMarshallMsg *impl_msg_typed=(CkMarshallMsg *)impl_msg;
char *impl_buf=impl_msg_typed->msgBuf;
/*Unmarshall pup'd fields: const Instance_t &instance, const ReceiveData_t &t*/
PUP::fromMem implP(impl_buf);
Instance_t instance; implP|instance;
ReceiveData_t t; implP|t;
impl_buf+=CK_ALIGN(implP.size(),16);
/*Unmarshall arrays:*/
impl_obj->template receive_data< ReceiveTag, ReceiveData_t >(instance, t);
}
</code>
where the last call should be calling std::move()
(which is trivial to write outside of the std namespace if necessary). To "emulate" this we currently move
the instance
and t
we receive in ElementChare
since we know that it should be an rvalue reference in C++11. However, [inline]
entry methods break this (theoretically, it sounds like not in practice) since then I may be passed a legitimate lvalue reference, which I shouldn't move, but instead should either copy or just use once. This can be fixed by using forwarding references in the receive_data
function of ElementChare
, since a const-lvalue reference cannot be moved. However, it would be much less worrisome if this was handle correctly by calling move
in the generated header file, so that using forwarding references with std::forward
works correctly. By correctly I mean both in the sense of the standard and in the sense that we should not be passing lvalue references when rvalues are the correct answer.
Original date: 2018-01-17 18:08:35
We may want to re-assign this to Evan? I think he might have some knowledge of this work, and I'm not sure Eric has time for it right now or the next few weeks
Original date: 2018-04-04 17:14:01
Reassigning to Evan in the hopes that he will have the right mix of time and expertise.
Original date: 2018-04-05 21:30:36
I'm having trouble understanding what problem(s) still need fixing for this issue to be considered complete. It appears that the discussion has shifted since the issue description was originally written, possibly without the original matter being 100% resolved, and I'm unsure if (or how much) the most recent discussion relates to or overlaps with Bug #1699.
Original date: 2018-04-05 21:45:31
Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:
Original date: 2018-04-06 00:09:26
Is there any existing code to profile/time these?
Original date: 2018-04-06 00:11:59
tests/charm++/pingpong has some of them, though I don't think it passes its arguments by const&
Original date: 2018-05-01 17:26:24
Sam White wrote:
Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:
- [inline] entry method
- [local] entry method
- Normal entry method
- Manually checking ckLocal() and then calling the C++ function directly
Should these test cases block v6.9 release?
Original date: 2018-05-01 17:30:56
Not really, we can do the testing on the beta version of the release as part of our usual release testing.
Original issue: https://charm.cs.illinois.edu/redmine/issues/921
I've been studying a fine grained code which attempts to use [inline] to reduce the overhead of computing across elements when they are known to be co-resident on the PE.
However, [inline] performs over an order of magnitude worse than if(proxy[i]->ckLocal()){ user function } else { proxy[i].entry()}
In the case of a marshaled call we go through the full process of constructing the message before we even test whether the target can be inlined. Then execution goes through CkSend with CK_SEND_INLINE set so that we ultimately land in CkSendMsgInline. Which will unpack the data we just packed, and call the function through its entry wrapper, with tracing disabled. In terms of runtime system overhead, the only real benefit to this scheme is that we don't take a trip through the scheduler. We're still incurring message pack and unpack overhead for data which doesn't actually have to go anywhere.
Is there a good reason why we need to abide by that restriction? Why can't we generate code which will test locality and make a function call to the user code, and only hit the rest of this overhead in the else case?