Currently CUDA dosen't support directly cast fp16 to (u)int8, however, we need firstly lift fp16 to fp32 using__half2float,then cast it to (u)int8. I don't know how often we need cast fp16 to int8, but i think there are some cases need cast fp16 to int8.
If it's OK, i'd like to add more test case to complete this PR.
Currently CUDA dosen't support directly cast fp16 to (u)int8, however, we need firstly lift fp16 to fp32 using__half2float,then cast it to (u)int8. I don't know how often we need cast fp16 to int8, but i think there are some cases need cast fp16 to int8. If it's OK, i'd like to add more test case to complete this PR.