Currently 64 bit operations are not fully supported on windows and macos. Additionally, 8 and 16 bit ops require extensions, and may not be fully supported as well.
For images, it is very beneficial to load data as u8 to the device and then convert it to floating point, as this increases bandwidth by a factor 4. What I did before was just pack the u8's into a u32, and then do bitwise operations to extract those into 4 u32's on the device. Not sure on the performance, but at least it's the most portable.
I would like to support bf16 eventually, even if 16 bit values are simply converted to f32s for operations. Potentially this may be faster due to 2x the bandwidth and memory.
Currently 64 bit operations are not fully supported on windows and macos. Additionally, 8 and 16 bit ops require extensions, and may not be fully supported as well.
For images, it is very beneficial to load data as u8 to the device and then convert it to floating point, as this increases bandwidth by a factor 4. What I did before was just pack the u8's into a u32, and then do bitwise operations to extract those into 4 u32's on the device. Not sure on the performance, but at least it's the most portable.
I would like to support bf16 eventually, even if 16 bit values are simply converted to f32s for operations. Potentially this may be faster due to 2x the bandwidth and memory.