autotuning convenience improvment request

seongwook-ham commented 6 years ago

in some case, i don't know why this happens but autotuning never finished. autotuner is freeze with "100/100" and job is unfinshing. in that case i try to use ctrl+c but it does not works. so i have to kill process and re run the autotuning script. it is very time consuming. so is it possible to pose timelimit for each generation, and when time is due, goes to next generation? or is there more efficient solution?

prigoyal commented 6 years ago

thanks @seongwook-ham for the report. Can you provide us a repro? It also depends on what operation you are running and sometimes compilation takes a lot of time that makes things almost hanging. We can look further into it if you can give an example of operation. Thanks :)

ftynse commented 6 years ago

autotuner is freeze with "100/100" and job is unfinshing.

That counter shows the number of compilation/execution jobs started rather than completed. One or multiple jobs are probably taking a lot of time. While compilation can be killed, there is no way to kill execution as CUDA kernels are not preemptable (neither can we kill the thread that controls it without killing everything else).

in that case i try to use ctrl+c but it does not works.

This is due to the interaction between python shell and TC. Current implementation will wait for the entire tuning generation to terminate before killing itself when Ctrl+C is pressed. We know it is suboptiomal, but a cleaner solution, aborting the generation immediately, requires non-trivial parallel mechanisms which we have not yet implemented.

so is it possible to pose timelimit for each generation, and when time is due, goes to next generation?

It is fundamentally impossible for CUDA kernels, unless some cases where you run the most recent CUDA with the most recent drivers and on some cards.

Can you provide us a repro?

Please consider providing a repro code and the logs with --debug_tuner option (or its python equivalent). Autotuner runs are non-deterministic and we are unlikely to reproduce the exact same behavior as you have. We can, however, look at why certain options require long time to either compile or execute.

thetheodor commented 6 years ago

so is it possible to pose timelimit for each generation, and when time is due, goes to next generation?

It is fundamentally impossible for CUDA kernels, unless some cases where you run the most recent CUDA with the most recent drivers and on some cards.

Maybe the following would work: Run each generation on a separate process and on a time-out kill the process. There are IPC facilities in CUDA that let mapping gpu memory from one process to another (to avoid copying/regenerating inputs/outputs). However, I have no idea in what state the gpu is left after killing one of the host processes that are using it.

concretevitamin commented 6 years ago

+1 to what OP said. Can confirm all those symptoms have hit me.

ftynse commented 6 years ago

"Ctrl+C" was fixed by #476 , timeout is in progress

facebookresearch / TensorComprehensions

autotuning convenience improvment request #381