charmplusplus / projections

Performance Analysis Tool for Charm++
Apache License 2.0
5 stars 3 forks source link

Timeline backtrace failure on OpenAtom traces #14

Closed ericjbohm closed 10 years ago

ericjbohm commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/457


Trace to sender, as invoked by the right click pulldown menu, fails to load for OpenAtom traces for nearly all entry methods. Failure error message is "Message was sent from outside the current time range". However, that is a clear misdiagnosis. When you have loaded an entire time step and examine methods from phases late in the time step, there is no way for the sender to have come from outside the step.

See ~bohm/work/w256_4k.t4 for example traces. It has a saved interval which matches a complete timestep. You can load a few timelines (0-3) and try to backtrace any method. The only back trace invocations which succeeded for me are for messages which where sender PE=receiver PE.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-03-31 22:11:27


Is this using uGNI SMP? I can see some tracing back working, some not work . Most likely this is due to some problem of tracing on comm thread. If there is no comm tracing, it will be likely to work

ericjbohm commented 5 years ago

Original date: 2014-03-31 22:44:52


Yanhua Sun wrote:

Is this using uGNI SMP? I can see some tracing back working, some not work . Most likely this is due to some problem of tracing on comm thread. If there is no comm tracing, it will be likely to work

Yes, those are runs on BlueWaters uGNI smp with comm thread tracing enabled. I'll try a version without commthread tracing.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-03-31 23:17:12


One thing I noticed is that some entry methods are declared to be 'local'. This confuses Projections, since there nested execution entries.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-01 03:57:51


I figured it out and submitted a fix in gerrit. If you do not want to wait for the long process. You can simple do this. in cklocation.C line 1774, change to _TRACE_BEGIN_EXECUTE_DETAILED(env->getEvent(), ForChareMsg,epIdx,env->getSrcPe(), env->getTotalsize(), idx. getProjectionID(env->getArrayMgrIdx()));

the main change is getSrcPe().

Can you try this with your openAtom and see whether it fixes. I tried simple programs and it worked

ericjbohm commented 5 years ago

Original date: 2014-04-01 19:01:34


I tested it and it does seem to work for the entry methods with are not [local]. So I think that the original bug is fixed.

It is a bit awkward in practice. Every worker thread traces back to the commthread and then you have to manually zoom in on the comm thread to find the corresponding sender event and initiate a backtrace from there to get to the external sender. Not a bug exactly, but it violates the user's desire to really get to actual sender rather than the intermediary event on the commthread. There may be some use cases where the user just wants the immediate precedecessor on the commthread, but I think most backtrace usage falls in the class of wanting it to automatically provide both the commthread hop and the external sender in one click.

On a related note, we use [local] to capture the situation when an event triggers a chain of categorically different computations. The whole point of [local] is to make these visible to Projections, so if they cause some confusion to Projections, then [local] is a failed feature. We were operating under the assumption that [local] events would be considered children of the enclosing parent event, and therefore a backtrace from a local should always go to the parent. Some of that will be cleared up a conversion to SDAG, but I think we want [local] to have well defined behavior regardless.

I can submit these as separate issues to redmine if you like.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-01 20:35:29


Tracking one message back will go to its comm thread. However, if you select trace back in the menu, you can track back multiple messages.

With comm thread tracing, the original information of sender is lost, replaced by the intermediate sender. If we want both, we need to modify log format, that will be more work.

Also if you want the real external sender, you can disable comm thread tracing.

ericjbohm commented 5 years ago

Original date: 2014-04-01 20:42:43


When using comm thread tracing, you usually want to see both the intermediate comm thread event and the external node sender. Why not have a projection trace back feature that would trace back through the comm thread to the originator?

PhilMiller commented 5 years ago

Original date: 2014-04-01 20:43:08


Since it sounds like the comm thread event will trace back to the external send, it would seem natural to have traced-back events that come from a comm thread to automatically trace back a step further (or two steps, if the remote comm thread sent on a worker thread's behalf) to their external source when clicked.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-01 20:46:31


The function is there. Now by default it tracks back 10+ messages. We can have an option to let users decide how many further messages to track back. I will assign this to Junior students as an exercise for them, who work on Projections. It should be straightforward.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-02 16:46:19


I will close this thread and open an issue for inline (local) entry method tracing back.