Open shahimshaar opened 3 years ago
Interesting. Are you, by any chance, using Docker or some other container system?
-The weird behavior regarding a status 137 error in that specific dragonfly test has been noted in the past when someone was using containers (#198). Since it didn't seem to affect regular usage of CODES, it was put on the backburner at the time due to some tight deadlines on my end followed by the rest of 2020's events!
-Just some cursory googling about the mpirun-as-root warning seems to imply that this also happens with docker containers and openmpi. Also noted in #198, adding a user appuser
to the container would avoid the usage of mpirun by the root.
I'll spend some time this weekend to see about making a "building CODES with Docker" workflow. In the mean time, I'd suggest ignoring these failed tests. Let me know if other errors pop up during your usage of CODES.
I’m trying to have ROSS CI testing run CODES tests and just found a hang on modelnet-test-dragonfly-synthetic.sh
. It’s most likely related since the Travis CI tests are running a container. Is there a way that I skip certain tests with the make check
command?
Followed the installation instructions and it all worked with no hiccups, but when I ran 'make check' it failed two tests: modelnet-prio-sched-test.sh and modelnet-test-dragonfly-synthetic.sh . Upon running 'cat ./test-suite.log' I got the following result:
================================= codes 1.2: ./test-suite.log
TOTAL: 22
PASS: 20
SKIP: 0
XFAIL: 0
FAIL: 2
XPASS: 0
ERROR: 0
.. contents:: :depth: 2
FAIL: tests/modelnet-test-dragonfly-synthetic.sh
credit_size not specified, using default: 8 no credit_delay specified - all credit delays set to 1.42 Within-node eager limit (node_eager_limit) not specified, setting to 16000 ../codes/tests/modelnet-test-dragonfly-synthetic.sh: line 3: 17976 Killed src/network-workloads/model-net-synthetic --sync=1 --num_messages=1 -- $srcdir/src/network-workloads/conf/modelnet-synthetic-dragonfly.conf FAIL tests/modelnet-test-dragonfly-synthetic.sh (exit status: 137)
FAIL: tests/modelnet-prio-sched-test.sh
Bandwidth of compute node channels not specified, setting to 20.000000 Within-node eager limit (node_eager_limit) not specified, setting to 16000 /home/shahm/codes-dev/build-codes/tests/.libs/modelnet-prio-sched-test --sync=1 -- tests/conf/modelnet-prio-sched-test.conf
Thu Aug 6 17:54:43 2020
ROSS Version: v7.2.0
tw_net_start: Found world size to be 1 NIC num injection port not specified, setting to 1 NIC seq delay not specified, setting to 10.000000 NIC num copy queues not specified, setting to 1 within node transfer per byte delay is 0.050000
ROSS Core Configuration: Total PEs 1 Total KPs [Nodes (1) x KPs (16)] 16 Total LPs 4 Simulation End Time 31536000000000000.00 LP-to-PE Mapping model defined
ROSS Event Memory Allocation: Model events 1025 Network events 16 Total events 1040
START SEQUENTIAL SIMULATION
Set num_servers per router 1, servers per injection queue per router 1, servers per node copy queue per node 1, num nics 1 END SIMULATION
TW Library Statistics: Total Events Processed 511 Events Aborted (part of RBs) 0 Events Rolled Back 0 Event Ties Detected in PE Queues 0 Efficiency 100.00 % Total Remote (shared mem) Events Processed 0 Percent Remote Events 0.00 % Total Remote (network) Events Processed 0 Percent Remote Events 0.00 %
TW Memory Statistics: Events Allocated 1041 Memory Allocated 618 Memory Wasted 454
TW Data Structure sizes in bytes (sizeof): PE struct 624 KP struct 144 LP struct 136 LP Model struct 96 LP RNGs 80 Total LP 312 Event struct 152 Event struct with Model 552
TW Clock Cycle Statistics (MAX values in secs at 1.0000 GHz): Initialization 0.7451 Priority Queue (enq/deq) 0.0000 AVL Tree (insert/delete) 0.0000 LZ4 (de)compression 0.0000 Buddy system 0.0000 Event Processing 0.0000 Event Cancel 0.0000 Event Abort 0.0000
TW GVT Statistics: MPI AllReduce GVT Interval 16 GVT Real Time Interval (cycles) 0 GVT Real Time Interval (sec) 0.00000000 Batch Size 16
mpirun has detected an attempt to run as root. Running at root is strongly discouraged as any mistake (e.g., in defining TMPDIR) or bug can result in catastrophic damage to the OS file system, leaving your system in an unusable state.
You can override this protection by adding the --allow-run-as-root option to your cmd line. However, we reiterate our strong advice against doing so - please do so at your own risk.
FAIL tests/modelnet-prio-sched-test.sh (exit status: 1)
Any help would be much appreciated!