facebookresearch / TensorComprehensions

A domain specific language to express machine learning workloads.
https://facebookresearch.github.io/TensorComprehensions/
Apache License 2.0
1.76k stars 211 forks source link

Running generated CUDA kernel outside of PyTorch #466

Open concretevitamin opened 6 years ago

concretevitamin commented 6 years ago

Hi,

I'm interested in running a TC-generated CUDA kernel outside of PyTorch. Currently, I'm using the TC options to specify grid and block dim3. E.g., with

    .mapToThreads(320)
    .mapToBlocks(32, 320)

from TC, I launch the auto-generated kernel (the __global__ func in /tmp/<tc>cuda) with the following:

dim3 grid(32,320);
dim3 block(320);
tc_kernel<<<grid, block>>> ( arguments with correct shapes; output buffer zero-out'd )

However, this seems to produce incorrect values compared to a reference implementation. Am I missing anything? Is there other necessary setup for a TC kernel to work standalone?

ftynse commented 6 years ago

TC removes blocks and threads that do nothing. .mapToThreads(320) does not mean the kernel will be effectively launched with 320 threads per block. And the kernel code assumes it is going to be executed with a specific number of threads and blocks to remove unnecessary conditions from the code.

Try tc.GlobalDebugInit(["--debug_tc_mapper=true", "--logtosdterr"]) and see what it outputs around the words "tightened launch bounds". These should be grid and block sizes effectively used for launching the kernel.

concretevitamin commented 6 years ago

@ftynse Thanks. I'm using the conda-installed version of TC, commit git_version: "8e112e9dccda62c30ef29208a827e783b9a7f156" where --logtosdterr is not available. Is there a workaround? Fundamentally, is there a way to figure out the launch config from already-tuned <hash>.{cuda,options} files?

Also, an orthogonal question. Let's say I previously had tuned a kernel with these cached output files:

/tmp/<hash>.cuda
/tmp/<hash>.options

If I want to start autotuning process off of this already-tuned kernel, do I pass layer.autotune(..., cache='/tmp/<hash>')? I'm seeing 100x worse "best" timing when I do this.

nicolasvasilache commented 6 years ago

@concretevitamin the commit mentioned is pretty ancient, any chance you could build from source using the new build system (see the new build instructions)? This way you would have an up-to-date version of TC and get fixes as they come. If that is too inconvenient you can also wait until we push a new TC conda package, it will take a few more days though.

Regarding the caching and iterating, we have been using the approach successfully from C++. There may be something lurking on the python side that we missed so a repro would always be useful. Note that we deprecated the cuda cache and only keep the topK best options (defaults to 10).

nicolasvasilache commented 6 years ago

@concretevitamin in particular, if you only want to use in Python and don't care about C++ dev or benchmarks then #470 should be pretty easy to follow.

ftynse commented 6 years ago

where --logtosdterr is not available.

Well, I've made a typo and it should be --logtostderr.

Fundamentally, is there a way to figure out the launch config from already-tuned .{cuda,options} files?

No. I would not have suggested to look at the debug output had there been such a way.

skimo-openhub commented 6 years ago

On Sun, Jun 03, 2018 at 11:09:09PM -0700, ftynse wrote:

Fundamentally, is there a way to figure out the launch config from already-tuned .{cuda,options} files?

No. I would not have suggested to look at the debug output had there been such a way.

Hmm... isn't the point that we should store this information somewhere?

skimo

ftynse commented 6 years ago

Hmm... isn't the point that we should store this information somewhere?

If we had stored the generated code in the actual codebase, then the answer would have been yes. Codegen returns the launch bounds, now it's a matter of exposing the codegen call itself to python. The caller can do whatever it wants with the results.

concretevitamin commented 6 years ago

@ftynse @nicolasvasilache I will give building from source a try.

Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :)

ftynse commented 6 years ago

Well, this is not how TC tuner was designed. It does not produce CUDA, but mapping options. Storing CUDA code is merely a side effect of running the kernel. I think we actually killed that storage completely in the master branch.

If you need the kernel and bounds description, give those options to the TC compiler and it will produce the desired result. Python interface seems to be missing the proper call for this, which has to be addressed. Nothing more.

Picking up your analogy, autotuner is more like comparing different NNs for test error. You keep the best architecture, but not necessarily the test set.

On Fri, Jun 8, 2018, 05:17 Zongheng Yang notifications@github.com wrote:

@ftynse https://github.com/ftynse @nicolasvasilache https://github.com/nicolasvasilache I will give building from source a try.

Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/facebookresearch/TensorComprehensions/issues/466#issuecomment-395633847, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcTa1qWaVr6P3b80WUidLxWN6tFE_OCks5t6ezngaJpZM4UWxWB .