Changed the default steps to 4 from 8, limiting the number of onnxruntime sessions with different batch size (used to get decent performance from directml). This has a performance impact for smaller nets, but seems to solve out of memory issues with some gpus.
Changed the default steps to 4 from 8, limiting the number of onnxruntime sessions with different batch size (used to get decent performance from directml). This has a performance impact for smaller nets, but seems to solve out of memory issues with some gpus.