So I have a CUDA example where I would like the host to be able to communicate byte-wise with the device. Currently I can't do this directly:
ERROR: kernel: hb_mc_manycore_read_mem: Input 'sz' = 1: only multiples of 4 are supported
ERROR: hb_mc_manycore_eva_read_internal: Failed to copy data from host to NPA
ERROR: 'hb_mc_manycore_eva_read(device->mc, &default_map, &pod->mesh->origin, &daddr, haddr, bytes)' failed: Not implemented
ERROR: 'hb_mc_device_memcpy(device, dst, src, sizeof(T), HB_MC_MEMCPY_TO_HOST)' failed: Not implemented
I understand doing things byte-wise will be 4x slower overall, but in my use case there will be extra processing required anyways to pack/unpack the bytes from 32-bit words, so sending bytes word-aligned just adds complexity to my program
Hello!
So I have a CUDA example where I would like the host to be able to communicate byte-wise with the device. Currently I can't do this directly:
I understand doing things byte-wise will be 4x slower overall, but in my use case there will be extra processing required anyways to pack/unpack the bytes from 32-bit words, so sending bytes word-aligned just adds complexity to my program
Thanks!