codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Failed modelnet-prio-sched-test.sh and modelnet-test-dragonfly-synthetic.sh on Ubuntu #206

Open shahimshaar opened 3 years ago

shahimshaar commented 3 years ago

Followed the installation instructions and it all worked with no hiccups, but when I ran 'make check' it failed two tests: modelnet-prio-sched-test.sh and modelnet-test-dragonfly-synthetic.sh . Upon running 'cat ./test-suite.log' I got the following result:

================================= codes 1.2: ./test-suite.log

TOTAL: 22

PASS: 20

SKIP: 0

XFAIL: 0

FAIL: 2

XPASS: 0

ERROR: 0

.. contents:: :depth: 2

FAIL: tests/modelnet-test-dragonfly-synthetic.sh

credit_size not specified, using default: 8 no credit_delay specified - all credit delays set to 1.42 Within-node eager limit (node_eager_limit) not specified, setting to 16000 ../codes/tests/modelnet-test-dragonfly-synthetic.sh: line 3: 17976 Killed src/network-workloads/model-net-synthetic --sync=1 --num_messages=1 -- $srcdir/src/network-workloads/conf/modelnet-synthetic-dragonfly.conf FAIL tests/modelnet-test-dragonfly-synthetic.sh (exit status: 137)

FAIL: tests/modelnet-prio-sched-test.sh

Bandwidth of compute node channels not specified, setting to 20.000000 Within-node eager limit (node_eager_limit) not specified, setting to 16000 /home/shahm/codes-dev/build-codes/tests/.libs/modelnet-prio-sched-test --sync=1 -- tests/conf/modelnet-prio-sched-test.conf

Thu Aug 6 17:54:43 2020

ROSS Version: v7.2.0

tw_net_start: Found world size to be 1 NIC num injection port not specified, setting to 1 NIC seq delay not specified, setting to 10.000000 NIC num copy queues not specified, setting to 1 within node transfer per byte delay is 0.050000

ROSS Core Configuration: Total PEs 1 Total KPs [Nodes (1) x KPs (16)] 16 Total LPs 4 Simulation End Time 31536000000000000.00 LP-to-PE Mapping model defined

ROSS Event Memory Allocation: Model events 1025 Network events 16 Total events 1040

START SEQUENTIAL SIMULATION

Set num_servers per router 1, servers per injection queue per router 1, servers per node copy queue per node 1, num nics 1 END SIMULATION

: Running Time = 0.0002 seconds

TW Library Statistics: Total Events Processed 511 Events Aborted (part of RBs) 0 Events Rolled Back 0 Event Ties Detected in PE Queues 0 Efficiency 100.00 % Total Remote (shared mem) Events Processed 0 Percent Remote Events 0.00 % Total Remote (network) Events Processed 0 Percent Remote Events 0.00 %

Total Roll Backs                                             0
Primary Roll Backs                                           0
Secondary Roll Backs                                         0
Fossil Collect Attempts                                      0
Total GVT Computations                                       0

Net Events Processed                                       511
Event Rate (events/sec)                              2823204.4
Total Events Scheduled Past End Time                         0

TW Memory Statistics: Events Allocated 1041 Memory Allocated 618 Memory Wasted 454

TW Data Structure sizes in bytes (sizeof): PE struct 624 KP struct 144 LP struct 136 LP Model struct 96 LP RNGs 80 Total LP 312 Event struct 152 Event struct with Model 552

TW Clock Cycle Statistics (MAX values in secs at 1.0000 GHz): Initialization 0.7451 Priority Queue (enq/deq) 0.0000 AVL Tree (insert/delete) 0.0000 LZ4 (de)compression 0.0000 Buddy system 0.0000 Event Processing 0.0000 Event Cancel 0.0000 Event Abort 0.0000

GVT                                                     0.0000
Fossil Collect                                          0.0000
Primary Rollbacks                                       0.0000
Network Read                                            0.0000
Other Network                                           0.0000
Instrumentation (computation)                           0.0000
Instrumentation (write)                                 0.0000
Total Time (Note: Using Running Time above for Speedup)      0.0005

TW GVT Statistics: MPI AllReduce GVT Interval 16 GVT Real Time Interval (cycles) 0 GVT Real Time Interval (sec) 0.00000000 Batch Size 16

Forced GVT                                                   0
Total GVT Computations                                       0
Total All Reduce Calls                                       0
Average Reduction / GVT                                   -nan

mpirun has detected an attempt to run as root. Running at root is strongly discouraged as any mistake (e.g., in defining TMPDIR) or bug can result in catastrophic damage to the OS file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root option to your cmd line. However, we reiterate our strong advice against doing so - please do so at your own risk.

FAIL tests/modelnet-prio-sched-test.sh (exit status: 1)

Any help would be much appreciated!

nmcglo commented 3 years ago

Interesting. Are you, by any chance, using Docker or some other container system?

-The weird behavior regarding a status 137 error in that specific dragonfly test has been noted in the past when someone was using containers (#198). Since it didn't seem to affect regular usage of CODES, it was put on the backburner at the time due to some tight deadlines on my end followed by the rest of 2020's events!

-Just some cursory googling about the mpirun-as-root warning seems to imply that this also happens with docker containers and openmpi. Also noted in #198, adding a user appuser to the container would avoid the usage of mpirun by the root.

I'll spend some time this weekend to see about making a "building CODES with Docker" workflow. In the mean time, I'd suggest ignoring these failed tests. Let me know if other errors pop up during your usage of CODES.

gonsie commented 3 years ago

I’m trying to have ROSS CI testing run CODES tests and just found a hang on modelnet-test-dragonfly-synthetic.sh. It’s most likely related since the Travis CI tests are running a container. Is there a way that I skip certain tests with the make check command?