INCF / MUSIC

MUSIC, the MUltiSimulation Coordinator
GNU General Public License v3.0
38 stars 38 forks source link

Segmentation fault when launching with POSTPONE:<color> #25

Open uahic opened 8 years ago

uahic commented 8 years ago

Use-case: I am trying to adapt the PyNN MUSIC branch towards PyNN 0.8.1 and NEST 2.10. While the PyNN part was no problem a segmentation fault occurs within MUSIC. This happens independently of PyNN-MUSIC.

The error can be reproduced when launching MUSIC with _MUSIC_CONFIG=POSTPONE:0 either with 'python ' or 'mpiexec -np 1 ' (launching music as single process)

@mdjurfeldt From what I can observe is that the old Python-config API sets POSTPONE: but the configuration parser actually expects an ApplicationMap-section within the ENV. Maybe I have missed something to do but in the current state it looks like that either the music-config/config.py must provide a full application-map in the ENV (=I need to change the way PyNN-Music/multisim.py assembles this) or that the MUSIC C++ code must be adapted.

Another question is what the runtime actually does when postpone is true? I mean it calls maybePostponedSetup but I dont see where the updated ENV's are actually parsed, maybe I do miss something?

`void Setup::maybePostponedSetup () { if (postponeSetup) { delete config; config_ = new Configuration (); fullInit (); } }

void Setup::fullInit () { errorChecks (); if (!config ("timebase", &timebase)) timebase = MUSIC_DEFAULTTIMEBASE; // default timebase string binary; config->lookup ("binary", &binary); string args; config->lookup ("args", &args); argv = parseArgs (binary, args, &argc); temporalNegotiator = new TemporalNegotiator (this); } `

uahic commented 8 years ago

I adjusted the Configuration class such that POSTPONE is now handled properly;

however I would like to know how to start MPI such that MUSIC is not handling all of the MPI nodes as member of the same Simulation group but that it is also possible to let two NEST simulation groups communicate with each other.

mdjurfeldt commented 8 years ago

Sorry for not returning to the POSTPONE issue for such a long time. I'm not sure what happened to this code, which was correct before. If you think that your fix is the right one, you are of course free to submit a pull request. In either case, I need to look at this and will do so ASAP.

Could you please clarify what you mean with your comment about simulation groups? Each MUSIC-aware application gets its own intracommunicator, associated with its own MPI process group, to be used for its internal communication. What happens above this is dependent on the communication algorithm selected. For pairwise communication, intercommunicators are created for use by MUSIC ports (these are not available through the MUSIC API). For collective communication, a communicator covering all MPI processes is instead used internally.

Given this, you can have two instances of NEST running as separate MUSIC-aware applications, with their own MPI process groups, and these can communicate with eachother through MUSIC ports. This sounds similar to what you request, but probably isn't. What is it that you request?

uahic commented 8 years ago

This answer is really what I was looking for; Here is my current understanding (feel free to correct me where I am wrong):

Using MUSIC via music binary (mpiexec .... music

If I launch two different NEST simulations that are coupled via MUSIC (this is what I mean -imprecisely speaking - meant with groups) it works all fine.

When launching mpiexec python , then MUSIC does not complain anymore after the little bugfix but NEST does. More in detail: it says that the random number generators are not in synchrony. All of the MPI ranks seem to be in the same group with respect to NEST. I had no deeper look into the MPI management of NEST but I think its simply using COMM_WORLD. Maybe there is (actually must be) the usage of intracommunicators when MUSIC is enabled but it seems not working as I do it right now, probably I need to crawl some NEST code to get a better understanding or you already know it?

Back to the pull-request story: Does POSTPONE work for you? if yes what MUSIC and pynn versions are you using? I fixed it for the most recent MUSIC version and merged the music subfolder from the old PyNN towards the 0.8.1 version repo (so I might need to ship that as well if you want to test it)

mdjurfeldt commented 8 years ago

Your understanding is correct.

Can you provide a simple test case demonstrating your python problem such that I can reproduce it on my machine? I will then debug it.

Getting back to you regarding POSTPONE.