FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python
https://flamegpu.com
MIT License
105 stars 20 forks source link

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

Open ptheywood opened 3 months ago

ptheywood commented 3 months ago

Encountered a bug when working on a non-trivial model, where when using the DAG api (dependsOn etc) errors would occur when abstracting the definition of agent function behaviours and inclusion in the control flow DAG to mehtods in a separate file which take a reference of the ModelDescription object. The same abstraciton but using layers behaves fine.

E.g. something along the lines of (untested)

main.cu

// ... 

#include "flamegpu/flamegpu.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}

Splitting out the agent funciton(s) into methods in a .cu file, with an associated header

other.cuh

#include "flamegpu/flamegpu.h"
namespace other {
void define(flamegpu::ModelDescription& model);
}  // namespace other```

`other.cu`
```c++
#include "other.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}
namespace other {
void define(flamegpu::ModelDescription& model){
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // add to the DAG, ideally in a separate method by getting mutable refs to functions, but error occurred even without that.

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);
}
}  // namespace other

main.cu


#include "flamegpu/flamegpu.h"
#include "other.h"

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.newAgent("agent");

    other::define(model);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}

In the separate larger model where this occurred, this resulted in runtime errors under linux (CUDA 12.5, GCC 11) resulted in runtime errors for the split case, while the first case was fine.

The runtime error was:

terminate called after throwing an instance of 'std::bad_array_new_length'
  what():  std::bad_array_new_length

Which via gdb had a backtrace pointing at DependencyNode::getDependents called by DependencyGraph::validateSubTree(DependencyNode* node, std::vector<DependencyNode*>& functionStack)

DependencyNode::dependents is a std::vector<DependencyNode*> dependents; but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

Robadob commented 3 months ago

but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

Default implicit constructor, which is implicitly called by subclass's constructor. I don't think that is the problem.