OFFIS-DAI / cosima

cosima integrated the communication simulator OMNeT++ into agent communication
https://cosima.offis.de/
GNU Lesser General Public License v2.1
2 stars 1 forks source link

Simulation stops before the expected time (research proj) #30

Open ihcsof opened 1 week ago

ihcsof commented 1 week ago

Hello,

I'm currently working on a research project that involves Cosima, the coupling between OMNeT++ and Mosaik. I’ve successfully integrated my simulator into the Mosaik scenario along with Cosima, and most of the time, everything works smoothly.

However, occasionally, depending on the graph and/or network being used, Mosaik stops prematurely before reaching the expected end, therefore interrupting the convergence of the graph. Although I've set the "until END" value quite high, the simulation sometimes halts, indicating it has received the "until" event and closes gracefully. This occurs inconsistently, even with the same exact configuration, where it can both fail and succeed.

After numerous attempts to resolve the problem, I’ve decided to seek help. I appreciate anyone taking the time to read this and provide any insights. If you are sure that this problem is not cosima-related feel free to tell me, I'm here to discover and learn :)

Thank you in advance, this is an example trace:

`2024-09-13 23:54:49:510600 | DEBUG | OMNeT++: client40 received traffic message at time 49.9860838 2024-09-13 23:54:49:510691 | DEBUG | OMNeT++: MosaikSchedulerModule: received until event. 2024-09-13 23:54:49:511005 | DEBUG | OMNeT++: MosaikScheduler: end simulation

<!> End of simulation

2024-09-13 23:54:49:657826 | DEBUG | OMNeT++: MosaikSchedulerModule: received max advance event at time 99999.999 2024-09-13 23:54:49:658073 | DEBUG | OMNeT++: MosaikScheduler: insert event 'send message to mosaik' for time 99999.999000 2024-09-13 23:54:49:658184 | DEBUG | OMNeT++: MosaikScheduler: send message with max advance info back to mosaik with Message id MaxAdvanceMessage_0 2024-09-13 23:54:49:658280 | DEBUG | OMNeT++: MosaikSchedulerModule: received event in order to send info back to mosaik at time 99999.999 2024-09-13 23:54:49.674264 | DEBUG | mosaik: case received max advance message with simtime 99999999 and max advance 99999999 2024-09-13 23:54:49.674358 | DEBUG | mosaik: CommunicationSimulator returns 99999999 as next_step at time 28981 2%|███▍ | 2353111/99999999 [22:16<15:24:23, 1760.55steps/s] 2024-09-13 23:54:49.677339 | DEBUG | mosaik: close connection called`

MalinRadtke214 commented 1 week ago

Hi,

glad you were able to successfully use the cosima project until the error occurred!

Regarding your error: Have you tried both fetching the latest version of cosima and using the recommended version of mosaik?

Beyond that, it is unfortunately somewhat difficult to recognize the reason. A hint would perhaps be that exceptions in mosaik are sometimes not output correctly in combination with OMNeT++. Perhaps it is worth looking for possible such error cases and integrating try-catch blocks.

ihcsof commented 1 week ago

Hi, unfortunately for now I'm "locked" to the versions of your provided Dockerfile, even if I checked for new versions and it seems that there're no changes since when I pulled. Thanking for your response I'm going to ask you a little detailed thing, that maybe is better understandable: As seen in the trace above, at a certain point the MosaikSchedulerModule receives an until event in handleMessage, but at that moment the omnet simTime() call already returns 999999 (the maxAdvance value, in fact I set until=that value in the mosaik world). The problem is that I can't trace this simTime() jump from whataver value (can be anything, is apparently "non deterministic") to the until value. So even if I manage to manually block the until event, obviously the simTime() will now return 999999 for every other call, eventually leading to the end of the simulation. So it would be amazing even just to know how to trace up this simulation time sudden change, because my only guess for now is that the omnet kernel changes it to 999999 in its library, and that's why I can't understand where it happens (I swear I tried printing simTime everywhere and I can't find a point in which it skips to that giant value before that handleMessage call). Is there some subtle interaction that I'm losing? Is there a way to prevent this from happen and just continue the simulation? Thank you really a lot!

Ps: the only thing that seems to be consistently present in all the runs, just before the failure, is the presence of a slightly higher number of "socket data arrived" messages, but everything I printed there seems to be okay. I also noted that the number of present messages is increasing (for omnet docs seems to be a possible segfault, but I never touched .cc codes so there shouldn't be a missed delete on my side, idk if in my version of cosima there is a now solved bug of this type)

MalinRadtke214 commented 4 days ago

Hi,

thanks again for the detailed explanations. I have three points:

1) In the logs above, it seems like there is somehow still an old version of cosima. The “MosaikScheduler” is now called “CosimaScheduler”. Please try to update the git project.

2) Also in the log above, it seems that at time 49.98... is currently a TrafficMessage that arrives at client40. Is this intentional? In any case, it would be worth trying to select the traffic configuration in scenario_config (https://github.com/OFFIS-DAI/cosima/blob/master/scenario_config.py) as follows: TRAFFIC_CONFIGURATION = []

3) Is it possible that this is actually the desired behavior and that there are actually no more events in OMNeT++ and/or mosaik at this point?

Please let me know how it works with these changes.

ihcsof commented 4 days ago

Thanks again for the response, in the meantime I advanced a bit in the exploration. Let me touch the three points: 1) Ok, I'll try with an updated version, I didn't notice that and from the date it seemed all updated 2) this is not sensible, the traffic was just a case because I was using it, but the bug is present also with an empty traffic config 3) I checked and there are a lot of events present (even too much, and in fact I thought it could be an overload due to mem leaks, but I excluded this option because it can stops in every moment (1s, 20s, 100s, 150s, whataver) so how can the simulator handle 150s once and not handle 1s?). I now append my new discoveries:

In this omnet function, at HEY1 the time is still OK, then in HEY3 it suddenly becomes 999999:

void cSimulation::executeEvent(cEvent *event)
{
#ifndef NDEBUG
    checkActive();
#endif
    setContextType(CTX_EVENT);

    // increment event count
    currentEventNumber++;

    std::cout << "HEY1: sim time: " << simTime() << std::endl;

    // advance simulation time
    currentSimtime = event->getArrivalTime();

    std::cout << "HEY3: sim time: " << simTime() << std::endl;

In the getArrivalTime() function in cevent.h it becomes coherently 999999 in that time, so later on I'll continue inspecting cevent functions to understand what is happening, but maybe this gives you a better idea. Thanks!

ihcsof commented 4 days ago

Unfortunately updating to the very last cosima version didn't helped

MalinRadtke214 commented 3 days ago

Thank you for the explanations.

I'm not sure if this is clear at this point, so here it is again as an explanation: OMNeT++ uses an event-based mechanism for scheduling. This means that the simulation time does not progress at fixed intervals (e.g. 1ms, 2ms, 3ms, ...), but according to the times of the events in the simulation. This means that if there are two events event1 (at time 1ms) and event2 (at time 20ms), then the simulation times would look like this: 1ms, 20ms In this case, this means that an event for the end of the simulation (the so-called UNTIL event) is already inserted at the start of the simulation in OMNeT++. This has the time 99999999. If no more events are scheduled in the future event set in OMNeT++, then the simulation time “jumps” to this event, i.e. to the time 99999999. This is exactly the behavior that can be observed here. The error therefore lies in the fact that there are no more events in the FES that occur before the end of the simulation.

Is this comprehensible and does it perhaps help with debugging?

ihcsof commented 3 days ago

Yes, it surely helps! I excluded this because I saw a lot of events in "present" and still events in "FES", but maybe all the events in FES are equally scehduled at 999999 and therefore Omnet skips. If I'm right "present" is not a parameter to take into account for my problem, but only FES, am I right? I'll try to understand why no more events are present and try to find a workaround for that (maybe I have to schedule events more frequently or keep the simulation alive with other methods). Thank you for your precious help

MalinRadtke214 commented 2 days ago

Great, that sounds good.

I'm not quite sure what the aim of the simulation is supposed to be, but theoretically this could simply be the normal, intended (?) behavior here.

Do I understand correctly that the issue is closed for now?

If so, I would close it and would also be available by email (malin.radtke@offis.de) if you have any further questions.

ihcsof commented 2 days ago

The aim of the simulation is to reach the convergence (which I'm sure that it will), so this bug (that is still present) prevents the correct behavior. If it's better for you we can continue via email and close it, but no it's not solved yet. I will investigate further with my research team in the next few days. If you know how to properly debug the events that were present at the unexpected end of the simulation it would be a good information! By the way, thank you for all the responses.

MalinRadtke214 commented 2 days ago

Perhaps you could get in touch by email and we could arrange an online meeting to talk? It's a bit difficult to understand what the problem is in the written form.

ihcsof commented 2 days ago

We kindly appreciate your help! I'm going to send you an email