Run Regent on a distributed system

crl123 commented 4 years ago

Good afternoon, I am running Regent on my cluster of 9 node with the following parameters: mpirun -np 9 -ppn 1 ./TaskBench/task-bench/regent/main.shard14 -steps 10 -type fft -kernel compute_bound -iter 1000000 And it is giving me the following problem: main.shard14: core.cc:588: void TaskGraph::execute_point(long int, long int, char*, size_t, const char*, const size_t, size_t, char*, size_t) const: Assertion `input[i].second == dep' failed.

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 6 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6) And sometimes the following problem: main.shard14: core.cc:565: void TaskGraph::execute_point(long int, long int, char*, size_t, const char*, const size_t, size_t, char*, size_t) const: Assertion `offset <= point && point < offset+width' failed.

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 6 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6) I have the same problem when I use the tree type, but when I use the stencil_1d type I don't have the problem. I compile regent as follows: DEFAULT_FEATURES=0 USE_REGENT=1 ./get_deps.sh export CXX=mpicxx export CC=mpicc ./build_all.sh Thank you in advance for your help,

elliottslaughter commented 4 years ago

Hi @crl123,

This means that Task Bench is computing the wrong result. I'm a little confused, I thought the Regent implementation was fully debugged.

I'm not expecting this to make a difference, but can you confirm what Task Bench branch/tag you're on?

I'll try to confirm on my end as well.

crl123 commented 4 years ago

I'm on the 'origin/master' branch. I updated the repository in my local machine on this Sunday.

elliottslaughter commented 4 years ago

Ok, I'm a bit swamped with things going on this week, but I'll try to find time to verify the Regent implementation on my own machine.

elliottslaughter commented 2 years ago

Sorry for taking so long to get back to this.

Looking back at your configuration here, I don't see any settings for the network. Typically you'd use something like:

export USE_GASNET=1
export CONDUIT=aries

Otherwise what you're doing is running N copies of the single-node program. Which is probably why this is misbehaving.

ysfess22 commented 1 year ago

Hi @elliottslaughter. I have a further question about multi-node benchmarks. Using gasnet the way you explained for a cluster with two nodes (udp conduit) creates double the number of tasks in the graph; half of the tasks is ran by node 1 and the other half by node 2. Is that the expected behaviour? Or is there a way to have the tasks be split between nodes? E.g., Given a 10x10 stencil graph, the 100 tasks would be split between two nodes.

elliottslaughter commented 1 year ago

@ysfess22 Please submit this as a new issue unless it's specifically related to the original posting.

The answer will depend on how you have configured your system, and I will require more information, which will clog this thread if it's not specifically related.

StanfordLegion / task-bench

Run Regent on a distributed system #69