CUDA Memcpy Byte Transfer Support

Hello!

So I have a CUDA example where I would like the host to be able to communicate byte-wise with the device. Currently I can't do this directly:

ERROR:   kernel: hb_mc_manycore_read_mem: Input 'sz' = 1: only multiples of 4 are supported
ERROR:   hb_mc_manycore_eva_read_internal: Failed to copy data from host to NPA
ERROR:   'hb_mc_manycore_eva_read(device->mc, &default_map, &pod->mesh->origin, &daddr, haddr, bytes)' failed: Not implemented
ERROR:   'hb_mc_device_memcpy(device, dst, src, sizeof(T), HB_MC_MEMCPY_TO_HOST)' failed: Not implemented

I understand doing things byte-wise will be 4x slower overall, but in my use case there will be extra processing required anyways to pack/unpack the bytes from 32-bit words, so sending bytes word-aligned just adds complexity to my program

Thanks!

bespoke-silicon-group / bsg_replicant

CUDA Memcpy Byte Transfer Support #817