Open pramodk opened 3 years ago
Copying @georgemitenkov's comments from https://github.com/BlueBrain/nmodl/pull/533#issuecomment-791897403 :
---- start ----
@pramodk Regarding testing, I had one idea:
1) Generate llvm::Module
using the pipeline
2) Add a new file test_llvm_kernels.cpp
or something like that. In that file, we create Instance struct artificially, and write cpp wrappers to print contents before/after the kernel execution.
3)Link the wrapper llvm::Module
with our llvm::Module
(For my GSoC I was using a similar strategy actually, so I have an idea of how this is done with LLVM API).
4) Simply feed this into llvm_nmodl_runner
and see what are the outputs :)
This is not actual IR check but suits integration test purposes.
For example something like this:
#include <stdio.h>
// ================= LLVM kernel generated from the pipeline ======================== //
struct Bar {
int* __restrict__ indices;
double* __restrict__ voltage;
int num_nodes;
};
void kernel(Bar* b) {
double v = -1.0;
b->voltage[b->indices[0]] = v * b->voltage[b->indices[0]];
b->voltage[b->indices[1]] = v * b->voltage[b->indices[1]];
}
// ================= Helpers that would come from wrapper class ==================== //
void print_struct(Bar *b) {
printf("num nodes: %d\n", b->num_nodes);
printf("indices: ");
for (int i = 0; i < b->num_nodes; ++i) {
printf("%d", b->indices[i]);
if (i < b->num_nodes - 1) printf(", "); else printf("\n");
}
printf("voltage: ");
for (int i = 0; i < b->num_nodes; ++i) {
printf("%.2f", b->voltage[i]);
if (i < b->num_nodes - 1) printf(", "); else printf("\n");
}
}
int main() {
Bar b;
b.num_nodes = 2;
int indices[] = {0, 1};
double voltage[] = {5.0, 10.0};
b.indices = indices;
b.voltage = voltage;
printf(" == Before == ");
print_struct(&b);
kernel(&b);
printf(" == After == ");
print_struct(&b);
return 0;
}
I am currently using this to verify the vectorised code.
---- end ----
Add a new file test_llvm_kernels.cpp or something like that. In that file, we create Instance struct artificially, and write cpp wrappers to print contents before/after the kernel execution.
I was thinking in similar direction! Before writing details about what I was thinking, let me clarify few questions regarding your proposal:
LLVM kernel generated from the pipeline
looks good. That is generated in LLVM IR by codegen visitor pass.Helpers that would come from wrapper class
which will be test_llvm_kernels.cpp
:
Bar
in main()
because we don't know the full type / definition of Bar
?Bar
type is different for different MOD files (e.g. number of member variables and their types will be different).Considering above questions, I was thinking of following:
LLVM kernel generated from the pipeline
. So compute kernels are ready to call.INSTANCE_STRUCT
with "correct" number of variables but we don't know the INSTANCE_STRUCT
type at compile time (assuming different mod files)As shown above, one has just to take care of alignment/padding aspect i.e. we have to pin pointers or non-pointer variables at particular offset.
foo
and we have to create INSTANCE_STRUCT_FOO
type which has X number of double*
, Y number of int*
, Z number double
typessizeof(INSTANCE_STRUCT_FOO)
.num_nodes
, we allocate separate vectors for each member in INSTANCE_STRUCT_FOO
and setup pointers at appropriate offset in the memory block allocated for INSTANCE_STRUCT_FOO
INSTANCE_STRUCT_FOO
to llvm_nmodl_runner
and it can call LLVM generated kernels.num_nodes
, we can setup vectors manually so that deterministic results can be compared by hand.num_nodes
to measure performance and correctness for different backends. We can duplicate INSTANCE_STRUCT_FOO
memory block and can pass it non-simd, simd, or GPU kernels and compare them with each other.INSTANCE_STRUCT_FOO
with values as you describedtest_llvm_kernels.cpp
Does this make sense?
The reason I am thinking with above approach is that 1) we don't know the type of INSTANCE_STRUCT_FOO
at compile time 2) this approach could be used for non-LLVM backends as well.
Implementing above wouldn't be complicated : allocating some memory block and setup pointers at particular offset considering alignment. But with LLVM API, if you think it would be even more easier, then feel free to propose!
cc: @iomaganaris
Edit : may be I can provide a pseudo code later today and that might help to explain my text.
Here is very abstract code for above logic:
SCENARIO("compute kernel test", "[llvm][runner]") {
GIVEN("mod file ") {
std::string nmodl_text = R"(
NEURON {
SUFFIX hh
USEION na READ ena WRITE ina
USEION k READ ek WRITE ik
NONSPECIFIC_CURRENT il
RANGE gnabar, gkbar, gl, el, gna, gk
RANGE minf, hinf, ninf, mtau, htau, ntau
THREADSAFE : assigned GLOBALs will be per thread
}
...
DERIVATIVE states {
m' = (minf-m)/mtau
h' = (hinf-h)/htau
n' = (ninf-n)/ntau
}
)";
NmodlDriver driver;
const auto& ast = driver.parse_string(nmodl_text);
...
codegen::CodegenLLVMHelperVisitor v(.....);
v.visit_program(*ast);
...
// we now retrieve information about how many double*, int*, double and int in the structure
auto& some_instance_struct_info = v.get_some_useful_instance_struct_info();
..
// here we allocate instruct struct object object with same seed and hence data1 and data2 are same
// `allocate_and_initialize_instruct_struct` will allocate base struct and will setup pointers to actual data
// note the data is just `void*` which can be type cast to actual type inside JIT runner
void* data1 = allocate_and_initialize_instruct_struct(some_instance_struct_info, SEED1, NUM_NODE_COUNT);
void* data2 = allocate_and_initialize_instruct_struct(some_instance_struct_info, SEED1, NUM_NODE_COUNT);
...
// based on the backends, we now can run kernels with different backends / vector width
Runner your_runner1(m, data1, vector_width=1);
Runner your_runner2(m, data2, vector_width=4);
Runner your_runner3(m, data3, gpu=true);
...
// compare the results or print them if required
compare_data_with_some_condition(data1, data2);
compare_data_with_some_condition(data1, data3);
...
// cleanup
deallocate_instruct_struct(data1);
deallocate_instruct_struct(data2);
Right, I see! We indeed do not know what the InstanceStruct would be until we parse the AST. This seems logical approach to me and I think it is good because of the uniformity over all backends. I will look more into this as well.
Also:
When using Runner your_runner1(m, data1, vector_width=1);
do you mean that the backend is the backend of the LLVM pipeline?
What would be the logic allocate_and_initialize_instruct_struct
? As I understand, we want num_pointers * 8 + sizeof(whatever is left)
allocated. But how the actual test data would be provided?
After some more thinking, I had another idea that helps in my opinion (not really specific to anything:) ) I think that I also better understand this approach, so we can have a sync later on the weekend/Monday to discuss more.
We define a c++ class for all inputs:
struct InstanceInfo {
basePtr;
offsetsPtr;
sizesPtr;
num_elems;
};
Then InstanceInfo
can be transformed to LLVM struct! We can have something like:
// We will call these functions in our LLVM wrapper file
extern "C" void _interface_init_struct(InstanceInfo *info) {
// C code to set up the fields;
}
extern "C" void _interface_print_struct(InstanceInfo *info) {
// C code to print struct;
}
The only thing that is left, is to define a conversion to our struct. We can define:
// here info contains the data, type is taken from AST or LLVM generated kernel code.
llvm::Value* infoToStruct(InstanceInfo *info, llvm::Type * instanceType) {
// code that produces instructions to transform Info to the struct we need: basically
// iterate over `num_elems` and get the members with the size/offset calculation.
}
Overall, we generate LLVM module following steps:
extern C
, etc.)infoToStruct()
call void @kernel(%our_struct_type *s)
call _interface_print_struct
When using Runner your_runner1(m, data1, vector_width=1); do you mean that the backend is the backend of the LLVM pipeline?
Yes, I was thinking LLVM backends with AVX2, AVX512 or ARM NEON. (but same data structure could be used for testing non-LLVM based backends but that would need additional work for runners)
What would be the logic allocate_and_initialize_instruct_struct? As I understand, we want num_pointers * 8 + sizeof(whatever is left) allocated.
That's correct. Just a note : one needs to bit careful about the size of struct due to padding / alignment. We have simple struct with double, int, double and int. So it's not that complicated but something to keep in mind.
But how the actual test data would be provided?
char *instance = allocate(whatever size required for pointers + extra data + padding/alignment );
// 1st member data
*(instance + 0) = allocate (sizeof(double) * node_count);
// 2nd member data (8 considering pointer size)
*(instance + 8) = allocate (sizeof(double) * node_count);
... similar offset calculation for rest of the data members and their data allocation
// double and int variables are directly stored as values
*(instance + X) = 0.025 // dt
After some more thinking, I had another idea that helps in my opinion (not really specific to anything:) ) I think that I also better understand this approach, so we can have a sync later on the weekend/Monday to discuss more.
The only thing that is left, is to define a conversion to our struct. We can define: // code that produces instructions to transform Info to the struct we need: basically // iterate over
num_elems
and get the members with the size/offset calculation.
Yeah, I think discussion would be helpful. I was thinking about padding/alignment aspects to avoid this transformation. i.e. if you create a memory block with right pointers then you can directly typecast the pointer to instanceType. May be discussion would clarify things!
char instance = allocate(whatever size required for pointers + extra data + padding/alignment ); // 1st member data (instance + 0) = allocate (sizeof(double) node_count); // 2nd member data (8 considering pointer size) (instance + 8) = allocate (sizeof(double) node_count); ... similar offset calculation for rest of the data members and their data allocation // double and int variables are directly stored as values (instance + X) = 0.025 // dt
I see, thank you for the example!
Assume sample mod file like this:
The struct for holding all data is generated looks like this:
And compute function generated looks like:
This compute kernel generated in-memory and translated to LLVM IR. Our goal is to:
What needs to happen?
INSTANCE_STRUCT
instancenrn_state_hh
with theINSTANCE_STRUCT
parameterAs kernels and
INSTANCE_STRUCT
are generated dynamically, how to do such testing?