codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

CODES store mapping failing (Imported #142) #142

Closed nmcglo closed 5 years ago

nmcglo commented 8 years ago

Original Issue Author: Misbah Mubarak Original Issue ID: 142 Original Issue URL: https://xgitlab.cels.anl.gov/codes/codes/issues/142


When running in serial mode, CODES checkpoint test is failing to find associated model-net LP and gives a fatal error of sending to invalid MPI rank. IIRC the test used to run fine until we made the modelnet_dragonfly_router changes to the mapping API. To reproduce the error:

./tests/test-checkpoint --sync=1 --codes-config=../tests/conf/test-checkpoint-dfly.conf

This issue is blocking me right now so I am labeling it as high priority.

nmcglo commented 8 years ago

Misbah Mubarak:

Status changed to closed

nmcglo commented 8 years ago

Jonathan Jenkins:

mentioned in issue #143

nmcglo commented 8 years ago

Jonathan Jenkins:

The terminal LP's router id is calculated as so:

s->router_id=(int)s->terminal_id / (s->params->num_routers/2);

terminal_id is 1044 and the num_routers parameter (in the configuration file with the same name) is 4, making router_id 522. Is there some underlying assumption with the terminal/router/group makeup that's being violated with this setup? I'm also getting the simulation output

Total nodes 72 routers 36 groups 9 radix 8

Which doesn't match with the router count in the config file. Does that have something to do with it?

nmcglo commented 8 years ago

Jonathan Jenkins:

Valgrind is clean on my end, so no memory corruption...

nmcglo commented 8 years ago

Jonathan Jenkins:

Ok, a couple things so far:

This comes from packet_send, line 1263. s->router_id, which is used as the repetition is 522. The configuration only has 264 routers. Is the initial calculation of s->router_id suspect?

nmcglo commented 8 years ago

Misbah Mubarak:

The num_routers entry in the config file was inconsistent with the repetitions thats why we were getting this issue. There are some safety checks in the dragonfly model now to tell if the num_routers is inconsistent with the repetition so we shouldn't be overlooking this in the future.