HigherOrderCO / HVM

A massively parallel, optimal functional runtime in Rust
https://higherorderco.com
Apache License 2.0
10.52k stars 400 forks source link

dylib open/call/close io functions, initial ffi api #394

Closed enricozb closed 4 months ago

enricozb commented 5 months ago

Overview

Adds DL_OPEN, DL_CALL, and DL_CLOSE IO functions.

C Runtime

Example usage looks like this: A user first defines some C functions they want to invoke through HVM at runtime:

#include "src/hvm.h"
#include <stdio.h>

Port add_two_nums(Net* net, Book* book, Port argm) {
  Tup tup = readback_tup(net, book, argm, 2);
  u32 num1 = get_u24(get_val(tup.elem_buf[0]));
  u32 num2 = get_u24(get_val(tup.elem_buf[1]));

  printf("adding numbers, %u and %u\n", num1, num2);

  return new_port(NUM, new_u24(num1 + num2));
}

Port print_num(Net* net, Book* book, Port argm) {
  u32 num = get_u24(get_val(argm));

  printf("printing number %u\n", num);

  return new_port(ERA, 0);
}

Functions must have the signature Port (my_func)(Net*, Book*, Port).

This file must be compiled as a shared library (.so). For example, gcc -shared my-funcs.c -o my-funcs.so. The file can then be loaded and symbols can be accessed. In Bend this looks like:

def main():
  x = 123
  y = 456

  with IO_T:
    dl <- call_io("DL_OPEN", ("./my-funcs.so", 0))
    res <- call_io("DL_CALL", (dl, "add_two_nums", x, y))
    * <- call_io("DL_CALL", (dl, "print_num", res))
    * <- call_io("DL_CLOSE", dl)

    return 42

C Compiled Mode

When compiling a generated HVM C file, you must use the -rdynamic flag to enable the shared library to access symbols from the main binary. For example,

cargo run -r -- gen-c testing-ffi.hvm > testing-ffi.c
gcc -rdynamic -lm testing-ffi.c -o testing-ffi

CUDA Runtime

The FFI is a little different, the above C file would look like this instead:

#include "src/hvm.cuh"
#include <stdio.h>

Port add_two_nums(GNet* gnet, Port argm) {
  Tup tup = gnet_readback_tup(gnet, argm, 2);
  u32 num1 = get_u24(get_val(tup.elem_buf[0]));
  u32 num2 = get_u24(get_val(tup.elem_buf[1]));

  printf("adding numbers, %u and %u\n", num1, num2);

  return new_port(NUM, new_u24(num1 + num2));
}

Port print_num(GNet* net, Port argm) {
  u32 num = get_u24(get_val(argm));

  printf("printing number %u\n", num);

  return new_port(ERA, 0);
}

And functions must have the signature Port (my_func)(GNet*, Port).

CUDA Compiled Mode

When compiling a generated HVM C file, you must use the -rdynamic flag to the host compiler to enable the shared library to access symbols from the main binary. For example,

cargo run -r -- gen-cu testing-ffi.hvm > testing-ffi.cu
nvcc --compiler-options=-rdynamic testing-ffi.c -o testing-ffi

HVM FFI API

Not everything is exposed to users at the moment, we expose

See hvm.h for users of the C runtime. See hvm.cuh for users of the CUDA runtime.

HigherOrderBot commented 5 months ago

Perf run for 6fc6cd9:

compiled
========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        3.47s           5.40s
                cuda                     0.23s           0.24s
--------------------------------------------------------------
sum_rec         c                        1.46s           1.44s
                cuda                     0.15s           0.13s
--------------------------------------------------------------
sum_tree        c                        0.13s           0.12s
                cuda                     0.10s           0.10s
--------------------------------------------------------------
tuples          c                        3.99s           3.32s
                cuda                   timeout         timeout
--------------------------------------------------------------

interpreted
===========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        3.54s           3.54s
                cuda                     0.25s           0.24s
                rust                   timeout         timeout
--------------------------------------------------------------
sum_rec         c                        2.50s           3.54s
                cuda                     0.15s           0.14s
                rust                    13.96s          13.51s
--------------------------------------------------------------
sum_tree        c                        0.19s           0.43s
                cuda                     0.09s           0.09s
                rust                     0.88s           0.88s
--------------------------------------------------------------
tuples          c                        5.41s           3.63s
                cuda                   timeout         timeout
                rust                     3.79s           3.79s
--------------------------------------------------------------
HigherOrderBot commented 5 months ago

Perf run for 05f1cc7:

compiled
========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        3.70s           4.21s
                cuda                     0.24s           0.24s
--------------------------------------------------------------
sum_rec         c                        1.38s           1.44s
                cuda                     0.15s           0.15s
--------------------------------------------------------------
sum_tree        c                        0.12s           0.12s
                cuda                     0.09s           0.09s
--------------------------------------------------------------
tuples          c                        2.88s           4.01s
                cuda                   timeout         timeout
--------------------------------------------------------------

interpreted
===========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        4.08s           5.33s
                cuda                     0.24s           0.23s
                rust                   timeout         timeout
--------------------------------------------------------------
sum_rec         c                        1.73s           1.72s
                cuda                     0.14s           0.13s
                rust                    13.51s          13.62s
--------------------------------------------------------------
sum_tree        c                        0.31s           0.20s
                cuda                     0.09s           0.09s
                rust                     0.88s           0.88s
--------------------------------------------------------------
tuples          c                        3.53s           2.09s
                cuda                   timeout         timeout
                rust                     3.79s           3.81s
--------------------------------------------------------------
HigherOrderBot commented 5 months ago

Perf run for 56a1dcb:

compiled
========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        3.24s           3.70s
                cuda                     0.24s           0.24s
--------------------------------------------------------------
sum_rec         c                        1.42s           1.38s
                cuda                     0.14s           0.14s
--------------------------------------------------------------
sum_tree        c                        0.11s           0.12s
                cuda                     0.10s           0.10s
--------------------------------------------------------------
tuples          c                        2.95s           2.90s
                cuda                   timeout         timeout
--------------------------------------------------------------

interpreted
===========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        5.74s           3.47s
                cuda                     0.24s           0.24s
                rust                   timeout         timeout
--------------------------------------------------------------
sum_rec         c                        1.68s           1.76s
                cuda                     0.14s           0.13s
                rust                    13.34s          13.57s
--------------------------------------------------------------
sum_tree        c                        0.36s           0.34s
                cuda                     0.09s           0.09s
                rust                     0.87s           0.88s
--------------------------------------------------------------
tuples          c                        4.99s           5.37s
                cuda                   timeout         timeout
                rust                     3.79s           3.80s
--------------------------------------------------------------
HigherOrderBot commented 4 months ago

Perf run for 0fc5635:

compiled
========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        5.53s           4.28s
                cuda                     0.24s           0.23s
--------------------------------------------------------------
sum_rec         c                        1.42s           1.42s
                cuda                     0.14s           0.14s
--------------------------------------------------------------
sum_tree        c                        0.12s           0.13s
                cuda                     0.11s           0.10s
--------------------------------------------------------------
tuples          c                        3.72s           4.16s
                cuda                   timeout         timeout
--------------------------------------------------------------

interpreted
===========

file            runtime         main            (local)       
==============================================================
sort_bitonic    c                        6.48s           4.42s
                cuda                     0.24s           0.24s
                rust                   timeout         timeout
--------------------------------------------------------------
sum_rec         c                        1.83s           2.03s
                cuda                     0.14s           0.13s
                rust                    13.69s          14.10s
--------------------------------------------------------------
sum_tree        c                        0.25s           0.17s
                cuda                     0.08s           0.08s
                rust                     0.83s           0.84s
--------------------------------------------------------------
tuples          c                        2.52s           2.51s
                cuda                   timeout         timeout
                rust                     3.76s           3.82s
--------------------------------------------------------------