Random placement locks up when there is 200+ devices (on Jennings)

jrbeaumont commented 3 years ago

DPD when running on Jennings (double instruction space) using random placement seems to lock up before anything happens, as if messages are not getting through. The same result does not occur when running with bucket placement.

The branch I'm using is the latest pull on FEATURE-0167-buffering_softswitch. I make clean'd and make all'd again before running this.

The input file in question can be found on Jennings at /home/jrbeaumont/dpd-baremetal/generated-xmls/dpd_oil_water_6_6_6.xml. (Attached for as well for ease). Any DPD volume larger than this suffers from the same problem.

Attached is the dump from place /dump = *, and all micrologs generated from loading the XML, to running it, to stopping it and dumping the placement.

rand-placement-issue.zip

heliosfa commented 3 years ago

Changing to bucket placement causes a seg fault. Valgrind points to Supervisor. Will debug further.

Message received
==655== Thread 4:
==655== Invalid read of size 4
==655==    at 0x5DC85DB: __fprintf_chk (fprintf_chk.c:30)
==655==    by 0x46ED8990: Supervisor::OnImplicit(poets_packet*, std::vector<poets_address_packet, std::allocator<poets_address_packet> >&) (in /home/gmb/.orchestrator/app_binaries/dpd_simulator__dpd_oil_water_6_6_6/libSupervisor.so)
==655==    by 0x46ED7894: SupervisorCall (in /home/gmb/.orchestrator/app_binaries/dpd_simulator__dpd_oil_water_6_6_6/libSupervisor.so)
==655==    by 0x13BEBD: SuperDB::call_supervisor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<poets_packet, std::allocator<poets_packet> >&, std::vector<poets_address_packet, std::allocator<poets_address_packet> >&) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x1355B9: Mothership::handle_msg_bend_supr(PMsg_p*) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x14B2E9: ThreadComms::mpi_application_resolver(void*) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x52D96DA: start_thread (pthread_create.c:463)
==655==    by 0x5DB7A3E: clone (clone.S:95)
==655==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==655==
==655==
==655== Process terminating with default action of signal 11 (SIGSEGV)
==655==  Access not within mapped region at address 0x0
==655==    at 0x5DC85DB: __fprintf_chk (fprintf_chk.c:30)
==655==    by 0x46ED8990: Supervisor::OnImplicit(poets_packet*, std::vector<poets_address_packet, std::allocator<poets_address_packet> >&) (in /home/gmb/.orchestrator/app_binaries/dpd_simulator__dpd_oil_water_6_6_6/libSupervisor.so)
==655==    by 0x46ED7894: SupervisorCall (in /home/gmb/.orchestrator/app_binaries/dpd_simulator__dpd_oil_water_6_6_6/libSupervisor.so)
==655==    by 0x13BEBD: SuperDB::call_supervisor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<poets_packet, std::allocator<poets_packet> >&, std::vector<poets_address_packet, std::allocator<poets_address_packet> >&) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x1355B9: Mothership::handle_msg_bend_supr(PMsg_p*) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x14B2E9: ThreadComms::mpi_application_resolver(void*) (in /home/gmb/Orchestrator/bin/mothership)
==655==    by 0x52D96DA: start_thread (pthread_create.c:463)
==655==    by 0x5DB7A3E: clone (clone.S:95)
==655==  If you believe this happened as a result of a stack
==655==  overflow in your program's main thread (unlikely but
==655==  possible), you can try to increase the size of the
==655==  main thread stack using the --main-stacksize= flag.
==655==  The main thread stack size used in this run was 8388608.
==655==
==655== HEAP SUMMARY:
==655==     in use at exit: 1,005,149,523 bytes in 7,532 blocks
==655==   total heap usage: 375,957 allocs, 368,425 frees, 1,018,259,788 bytes allocated
==655==
==655== LEAK SUMMARY:
==655==    definitely lost: 117 bytes in 1 blocks
==655==    indirectly lost: 0 bytes in 0 blocks
==655==      possibly lost: 1,824 bytes in 6 blocks
==655==    still reachable: 1,005,147,582 bytes in 7,525 blocks
==655==         suppressed: 0 bytes in 0 blocks
==655== Rerun with --leak-check=full to see details of leaked memory
==655==
==655== For counts of detected and suppressed errors, rerun with: -v
==655== Use --track-origins=yes to see where uninitialised values come from
==655== ERROR SUMMARY: 280029 errors from 12 contexts (suppressed: 0 from 0)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 655 RUNNING AT jennings
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

mvousden commented 3 years ago

This also segfaulted when we tried it. GMB is attaching a Valgrind dump. This error has a similar pathology to another issue I'm investigating at the moment (with GMB's reactive application), so we are investigating!

mvousden commented 3 years ago

This segfault is triggered because there's an fopen in user code, which doesn't check the exit code! It requires a directory to exist in the user's filesystem, without first creating it.

POETSII / Orchestrator

Random placement locks up when there is 200+ devices (on Jennings) #173