without backend specific native_batch_norm operator, the test goes failure and trainning goes well

smilealvin92 commented 6 months ago

Hi, I want to update an error about this. If we comment out the native_batch_norm and native_batch_norm_backward in norm_ops.cpp, we could found that the tranning phase goes well while test phase goes down in BN operator forwarding. The direct error come out from "Buffer is not valid for unallocated device" and this is because the TO op in line "c10::IValue(returns[idx].toTensor().to(*tgt_device));" in pytorch/aten/src/ATen/native/CPUFallback.cpp which trigger a copy to device.Maybe I should submit this issue in Pytorch repo, because this is a fallback error. Hope you would try this error.

artyom-beilis commented 6 months ago

Ok let me check this.

Can you give a very small sample that reproduces it?

smilealvin92 commented 6 months ago

just use mnist.py is enough to reproduce this error. And comment out native_batch_norm and its backward operator

smilealvin92 commented 6 months ago

Ok let me check this.

Can you give a very small sample that reproduces it?

thanks for your attention

artyom-beilis / pytorch_dlprim

without backend specific native_batch_norm operator, the test goes failure and trainning goes well #52