Strategy for unit testing compute kernels generated from LLVM backend

Assume sample mod file like this:

$ cat hh.mod
TITLE hh.mod   squid sodium, potassium, and leak channels

UNITS {
    (mA) = (milliamp)
    (mV) = (millivolt)
    (S) = (siemens)
}

NEURON {
    SUFFIX hh
    USEION na READ ena WRITE ina
    USEION k READ ek WRITE ik
    NONSPECIFIC_CURRENT il
    RANGE gnabar, gkbar, gl, el, gna, gk
    RANGE minf, hinf, ninf, mtau, htau, ntau
    THREADSAFE : assigned GLOBALs will be per thread
}

PARAMETER {
    gnabar = .12 (S/cm2)    <0,1e9>
    gkbar = .036 (S/cm2)    <0,1e9>
    gl = .0003 (S/cm2)    <0,1e9>
    el = -54.3 (mV)
}

STATE {
    m h n
}

ASSIGNED {
    v (mV)
    celsius (degC)
    ena (mV)
    ek (mV)
    gna (S/cm2)
    gk (S/cm2)
    ina (mA/cm2)
    ik (mA/cm2)
    il (mA/cm2)
    minf hinf ninf
    mtau (ms) htau (ms) ntau (ms)
}

BREAKPOINT {
    SOLVE states METHOD cnexp
    gna = gnabar*m*m*m*h
    ina = gna*(v - ena)
    gk = gkbar*n*n*n*n
    ik = gk*(v - ek)
    il = gl*(v - el)
}

DERIVATIVE states {
     m' =  (minf-m)/mtau
     h' = (hinf-h)/htau
     n' = (ninf-n)/ntau
}

The struct for holding all data is generated looks like this:

INSTANCE_STRUCT {
    DOUBLE *gnabar
    DOUBLE *gkbar
    DOUBLE *gl
    DOUBLE *el
    DOUBLE *gna
    DOUBLE *gk
    DOUBLE *il
    DOUBLE *minf
    DOUBLE *hinf
    DOUBLE *ninf
    DOUBLE *mtau
    DOUBLE *htau
    DOUBLE *ntau
    DOUBLE *m
    DOUBLE *h
    DOUBLE *n
    DOUBLE *Dm
    DOUBLE *Dh
    DOUBLE *Dn
    DOUBLE *ena
    DOUBLE *ek
    DOUBLE *ina
    DOUBLE *ik
    DOUBLE *v_unused
    DOUBLE *g_unused
    DOUBLE *ion_ena
    DOUBLE *ion_ina
    DOUBLE *ion_dinadv
    DOUBLE *ion_ek
    DOUBLE *ion_ik
    DOUBLE *ion_dikdv
    INTEGER *ion_ena_index
    INTEGER *ion_ina_index
    INTEGER *ion_dinadv_index
    INTEGER *ion_ek_index
    INTEGER *ion_ik_index
    INTEGER *ion_dikdv_index
    DOUBLE *voltage
    INTEGER *node_index
    DOUBLE t
    DOUBLE dt
    DOUBLE celsius
    INTEGER secondorder
    INTEGER node_count
}

And compute function generated looks like:

VOID nrn_state_hh(INSTANCE_STRUCT *mech){
    INTEGER id
    for(id = 0; id<mech->node_count; id = id+1) {
        INTEGER node_id, ena_id, ek_id
        DOUBLE v
        node_id = mech->node_index[id]
        ena_id = mech->ion_ena_index[id]
        ek_id = mech->ion_ek_index[id]
        v = mech->voltage[node_id]
        mech->ena[id] = mech->ion_ena[ena_id]
        mech->ek[id] = mech->ion_ek[ek_id]
        mech->m[id] = mech->m[id]+(1.0-exp(mech->dt*((((-1.0)))/mech->mtau[id])))*(-(((mech->minf[id]))/mech->mtau[id])/((((-1.0)))/mech->mtau[id])-mech->m[id])
        mech->h[id] = mech->h[id]+(1.0-exp(mech->dt*((((-1.0)))/mech->htau[id])))*(-(((mech->hinf[id]))/mech->htau[id])/((((-1.0)))/mech->htau[id])-mech->h[id])
        mech->n[id] = mech->n[id]+(1.0-exp(mech->dt*((((-1.0)))/mech->ntau[id])))*(-(((mech->ninf[id]))/mech->ntau[id])/((((-1.0)))/mech->ntau[id])-mech->n[id])
    }
}

This compute kernel generated in-memory and translated to LLVM IR. Our goal is to:

run such kernels as part of unit testing with serial, non-SIMD code
run such kernels as part of unit testing with vectorised, SIMD code and make sure the results are same as serial code on CPU
with GPU backend, execute such kernel on GPU and make sure the results are same as serial code on CPU

What needs to happen?

Create "right" INSTANCE_STRUCT instance
Call nrn_state_hh with the INSTANCE_STRUCT parameter

As kernels and INSTANCE_STRUCT are generated dynamically, how to do such testing?

Copying @georgemitenkov's comments from https://github.com/BlueBrain/nmodl/pull/533#issuecomment-791897403 :

---- start ----

@pramodk Regarding testing, I had one idea:

1) Generate llvm::Module using the pipeline 2) Add a new file test_llvm_kernels.cpp or something like that. In that file, we create Instance struct artificially, and write cpp wrappers to print contents before/after the kernel execution. 3)Link the wrapper llvm::Module with our llvm::Module (For my GSoC I was using a similar strategy actually, so I have an idea of how this is done with LLVM API). 4) Simply feed this into llvm_nmodl_runner and see what are the outputs :)

This is not actual IR check but suits integration test purposes.

For example something like this:

#include <stdio.h>

// ================= LLVM kernel generated from the pipeline ======================== //

struct Bar {
  int* __restrict__ indices;
  double* __restrict__ voltage;
  int num_nodes;
};

void kernel(Bar* b) {
  double v = -1.0;
  b->voltage[b->indices[0]] = v * b->voltage[b->indices[0]];
  b->voltage[b->indices[1]] = v * b->voltage[b->indices[1]];
}

// ================= Helpers that would come from wrapper class ==================== //

void print_struct(Bar *b) {
  printf("num nodes: %d\n", b->num_nodes);
  printf("indices: ");
  for (int i = 0; i < b->num_nodes; ++i) {
    printf("%d", b->indices[i]);
    if (i < b->num_nodes - 1) printf(", "); else printf("\n");
  }
  printf("voltage: ");
  for (int i = 0; i < b->num_nodes; ++i) {
    printf("%.2f", b->voltage[i]);
    if (i < b->num_nodes - 1) printf(", "); else printf("\n");
  }
}

int main() {
  Bar b;
  b.num_nodes = 2;
  int indices[] = {0, 1};
  double voltage[] = {5.0, 10.0};
  b.indices = indices;
  b.voltage = voltage;
  printf(" == Before == ");
  print_struct(&b);
  kernel(&b);
  printf(" == After == ");
  print_struct(&b);
  return 0;
}

I am currently using this to verify the vectorised code.

---- end ----

Add a new file test_llvm_kernels.cpp or something like that. In that file, we create Instance struct artificially, and write cpp wrappers to print contents before/after the kernel execution.

I was thinking in similar direction! Before writing details about what I was thinking, let me clarify few questions regarding your proposal:

Section about LLVM kernel generated from the pipeline looks good. That is generated in LLVM IR by codegen visitor pass.
Section about Helpers that would come from wrapper class which will be test_llvm_kernels.cpp:
- this is a static file i.e. manually written once, right?
- How can we instantiate Bar in main() because we don't know the full type / definition of Bar?
- Bar type is different for different MOD files (e.g. number of member variables and their types will be different).
But it's true that if we take a single mod file (or keep number of variables constant) then this approach will work.

Considering above questions, I was thinking of following:

As you pointed above, we have LLVM kernel generated from the pipeline. So compute kernels are ready to call.
We need to prepare INSTANCE_STRUCT with "correct" number of variables but we don't know the INSTANCE_STRUCT type at compile time (assuming different mod files)
struct type is nothing but some memory block and various members at specific offset in the memory

As shown above, one has just to take care of alignment/padding aspect i.e. we have to pin pointers or non-pointer variables at particular offset.

So I was thinking of following:
- lets say we have mod file foo and we have to create INSTANCE_STRUCT_FOO type which has X number of double*, Y number of int*, Z number double types
- considering alignment/padding aspect, we allocate memory block for sizeof(INSTANCE_STRUCT_FOO).
- assuming num_nodes, we allocate separate vectors for each member in INSTANCE_STRUCT_FOO and setup pointers at appropriate offset in the memory block allocated for INSTANCE_STRUCT_FOO
- we then can pass base pointer of INSTANCE_STRUCT_FOO to llvm_nmodl_runner and it can call LLVM generated kernels.
- For small num_nodes, we can setup vectors manually so that deterministic results can be compared by hand.
- We also need large num_nodes to measure performance and correctness for different backends. We can duplicate INSTANCE_STRUCT_FOO memory block and can pass it non-simd, simd, or GPU kernels and compare them with each other.
- we can also have utilities to print the INSTANCE_STRUCT_FOO with values as you described
- all this helper code will be in some test_llvm_kernels.cpp

Does this make sense?

The reason I am thinking with above approach is that 1) we don't know the type of INSTANCE_STRUCT_FOO at compile time 2) this approach could be used for non-LLVM backends as well.

Implementing above wouldn't be complicated : allocating some memory block and setup pointers at particular offset considering alignment. But with LLVM API, if you think it would be even more easier, then feel free to propose!

cc: @iomaganaris

Edit : may be I can provide a pseudo code later today and that might help to explain my text.

Here is very abstract code for above logic:


 SCENARIO("compute kernel test", "[llvm][runner]") {
     GIVEN("mod file ") {

         std::string nmodl_text = R"(

                 NEURON {
                     SUFFIX hh
                     USEION na READ ena WRITE ina
                     USEION k READ ek WRITE ik
                     NONSPECIFIC_CURRENT il
                     RANGE gnabar, gkbar, gl, el, gna, gk
                     RANGE minf, hinf, ninf, mtau, htau, ntau
                     THREADSAFE : assigned GLOBALs will be per thread
                 }
                 ...
                 DERIVATIVE states {
                      m' =  (minf-m)/mtau
                      h' = (hinf-h)/htau
                      n' = (ninf-n)/ntau
                 }

         )";

         NmodlDriver driver;
         const auto& ast = driver.parse_string(nmodl_text);
         ...

         codegen::CodegenLLVMHelperVisitor v(.....);
         v.visit_program(*ast);
         ...

         // we now retrieve information about how many double*, int*, double and int in the structure
         auto& some_instance_struct_info = v.get_some_useful_instance_struct_info();
         ..

         // here we allocate instruct struct object object with same seed and hence data1 and data2 are same
         // `allocate_and_initialize_instruct_struct` will allocate base struct and will setup pointers to actual data
         // note the data is just `void*` which can be type cast to actual type inside JIT runner
         void* data1 = allocate_and_initialize_instruct_struct(some_instance_struct_info, SEED1, NUM_NODE_COUNT);
         void* data2 = allocate_and_initialize_instruct_struct(some_instance_struct_info, SEED1, NUM_NODE_COUNT);
         ...

         // based on the backends, we now can run kernels with different backends / vector width
         Runner your_runner1(m, data1, vector_width=1);
         Runner your_runner2(m, data2, vector_width=4);
         Runner your_runner3(m, data3, gpu=true);         
          ...

         // compare the results or print them if required
         compare_data_with_some_condition(data1, data2);
         compare_data_with_some_condition(data1, data3);
          ...

          // cleanup 
         deallocate_instruct_struct(data1);
         deallocate_instruct_struct(data2);

Right, I see! We indeed do not know what the InstanceStruct would be until we parse the AST. This seems logical approach to me and I think it is good because of the uniformity over all backends. I will look more into this as well.

Also:

When using Runner your_runner1(m, data1, vector_width=1); do you mean that the backend is the backend of the LLVM pipeline?
What would be the logic allocate_and_initialize_instruct_struct? As I understand, we want num_pointers * 8 + sizeof(whatever is left) allocated. But how the actual test data would be provided?

After some more thinking, I had another idea that helps in my opinion (not really specific to anything:) ) I think that I also better understand this approach, so we can have a sync later on the weekend/Monday to discuss more.

We define a c++ class for all inputs:

struct InstanceInfo {
  basePtr;
  offsetsPtr;
  sizesPtr;
  num_elems;
};

Then InstanceInfo can be transformed to LLVM struct! We can have something like:

// We will call these functions in our LLVM wrapper file

extern "C" void _interface_init_struct(InstanceInfo *info) {
  // C code to set up the fields;
}

extern "C" void _interface_print_struct(InstanceInfo *info) {
  // C code to print struct;
}

The only thing that is left, is to define a conversion to our struct. We can define:

// here info contains the data, type is taken from AST or LLVM generated kernel code.
llvm::Value* infoToStruct(InstanceInfo *info, llvm::Type * instanceType) {
  // code that produces instructions to transform Info to the struct we need: basically
 // iterate  over `num_elems` and get the members with the size/offset calculation.
}

Overall, we generate LLVM module following steps:

Create LLVM main function
Fill InstanceInfo in some defined way (command line, predefined functions using extern C, etc.)
Emit a code to convert to our struct and fill the data using infoToStruct()
Create call void @kernel(%our_struct_type *s)
Create call _interface_print_struct

When using Runner your_runner1(m, data1, vector_width=1); do you mean that the backend is the backend of the LLVM pipeline?

Yes, I was thinking LLVM backends with AVX2, AVX512 or ARM NEON. (but same data structure could be used for testing non-LLVM based backends but that would need additional work for runners)

What would be the logic allocate_and_initialize_instruct_struct? As I understand, we want num_pointers * 8 + sizeof(whatever is left) allocated.

That's correct. Just a note : one needs to bit careful about the size of struct due to padding / alignment. We have simple struct with double, int, double and int. So it's not that complicated but something to keep in mind.

But how the actual test data would be provided?

char *instance = allocate(whatever size required for pointers + extra data + padding/alignment );

// 1st member data
*(instance + 0) = allocate (sizeof(double) * node_count);

// 2nd member data (8 considering pointer size)
*(instance + 8) = allocate (sizeof(double) * node_count);

... similar offset calculation for rest of the data members and their data allocation

// double and int variables are directly stored as values
*(instance + X) = 0.025 // dt

After some more thinking, I had another idea that helps in my opinion (not really specific to anything:) ) I think that I also better understand this approach, so we can have a sync later on the weekend/Monday to discuss more.

The only thing that is left, is to define a conversion to our struct. We can define: // code that produces instructions to transform Info to the struct we need: basically // iterate over num_elems and get the members with the size/offset calculation.

Yeah, I think discussion would be helpful. I was thinking about padding/alignment aspects to avoid this transformation. i.e. if you create a memory block with right pointers then you can directly typecast the pointer to instanceType. May be discussion would clarify things!

char instance = allocate(whatever size required for pointers + extra data + padding/alignment ); // 1st member data (instance + 0) = allocate (sizeof(double) node_count); // 2nd member data (8 considering pointer size) (instance + 8) = allocate (sizeof(double) node_count); ... similar offset calculation for rest of the data members and their data allocation // double and int variables are directly stored as values (instance + X) = 0.025 // dt

I see, thank you for the example!

BlueBrain / nmodl

Strategy for unit testing compute kernels generated from LLVM backend #540