This is a medium- to long-term proposal to modify our heuristics system to enable some machine-specific information to be used to choose heuristic parameters.
The goal is to have a python script that is run once on the user's machine whenever nvfuser is updated to create a file holding a collection of measured named variables: things like "matmul.splitk.elements_reduced_per_second", "normalization.additional_registers" etc. These are things that are very difficult to model accurately and are much more reliably set through experimentation, which is what the script would be doing.
Background
Currently most of our heuristics do not directly use measurements of machine performance. Instead, they are rule-based and are developed through trial and error on actual machines. Special casing for different architectures in heuristic code is necessary at times based on constraints in hardware or software, but AFAIK we don't currently write explicit parametrized timing models for any of our heuristics. To do so, we need some model parameters that are not trivial derive by querying the cuda device properties.
We currently have a plugin interface for the matmul heuristic #2049. This is a matmul scheduler-specific approach that allows a user to provide a C++ library at runtime that will take a problem description and previous heuristic and return an updated heuristic. The heuristic is a kernel configuration struct that can be converted to a MatmulParams object. Technically a user could use this to recreate something like the system described in this proposal, but it would be specific to the matmul scheduler, and we should facilitate doing this more easily for all our schedulers.
Proposal
I propose that we create a container class MachineInfo that behaves very much like std::unordered_map<std::string, PolymorphicValue> (except we really only need bool, int64_t, and double so we could probably wrap a smaller DynamicType). This would hold measured information that can be referred to in scheduler heuristics. For example, inside a normalization heuristic we might be interested in the bandwidth from global to shared memory at a given alignment so we would query it like this:
In order to populate a value like this, we would curate a collection of Fusions to execute and measure their timings in order to estimate the variables of interest in a regression approach. I propose that we do that in a Python script that we provide in tools.
The MachineInfo object would be serialized and loaded automatically. It could be edited by the user manually but most often it would be filled by a measurement script that would run a bunch of trial problems in order to estimate the values. We should save the nvfuser and other software versions along with device information in the serialized file and we should warn users when it is out of date, as the measured values might change based on changes to our codebase or to the installed NVRTC or driver versions.
Note that this design is meant to be lightweight and "forgiving". If values are missing we should provide some reasonable defaults but warn the user that measuring them is best. We should not be sticklers about the schema because that adds friction in developing heuristics, and because we want to enable the use of user-defined data in custom heuristics.
This would not affect the canSchedule static methods: we would only provide this object to the SchedulerEntry subclass constructor, e.g. to the InnerPersistentKernelScheduler ctor.
Details
Python interop
I think it makes sense for us to subclass py::object for our HeuristicParams subclasses as well as for the machine profile class so that we have a standard way to interoperate with python. This will let us write a python script to acquire the machine profile; timing can be done from python by the following steps:
Create a FusionDefinition for a particular problem
Provide a set of heuristic parameters as a python object. This will need to be plumbed in but can be done generally so that heuristics for any scheduler can be passed in and multiple heuristic params can be passed if the kernel is segmented.
Use the profiling interface to collect kernel times
Storing the device profiles
The "profile" is serializable to a file that we should store in the user's home directory ($XDG_DATA_HOME on linux?). We can also override that location with an env var, pass it as an argument to fd.execute, etc.
Default profiles
For common datacenter devices, we could prepackage a set of profiles and include them whenever we cut a release. These will still not capture all environment details like nvrtc specifics, so we should still print a warning if we don't detect an exact match.
Related ideas
Since this proposal includes a mechanism for passing problem descriptions and heuristic params to and from Python, it also opens the door to writing heuristics directly in Python. This has been explored a little bit in #2106 for the matmul case, but this would give us a standard way to override or augment any of the built-in C++ heuristics from python. Then if users wish to experiment with their own heuristic, or even implement some form of autotuning, they could simply do so by overriding the heuristic from python.
Sort of orthogonal, but should a machine parameter be a scalar (named?) Val? We discussed ideas of representing heuristics as Vals so that they could be symbolically manipulated and evaluated.
This is a medium- to long-term proposal to modify our heuristics system to enable some machine-specific information to be used to choose heuristic parameters.
The goal is to have a python script that is run once on the user's machine whenever nvfuser is updated to create a file holding a collection of measured named variables: things like "matmul.splitk.elements_reduced_per_second", "normalization.additional_registers" etc. These are things that are very difficult to model accurately and are much more reliably set through experimentation, which is what the script would be doing.
Background
Currently most of our heuristics do not directly use measurements of machine performance. Instead, they are rule-based and are developed through trial and error on actual machines. Special casing for different architectures in heuristic code is necessary at times based on constraints in hardware or software, but AFAIK we don't currently write explicit parametrized timing models for any of our heuristics. To do so, we need some model parameters that are not trivial derive by querying the cuda device properties.
We currently have a plugin interface for the matmul heuristic #2049. This is a matmul scheduler-specific approach that allows a user to provide a C++ library at runtime that will take a problem description and previous heuristic and return an updated heuristic. The heuristic is a kernel configuration struct that can be converted to a
MatmulParams
object. Technically a user could use this to recreate something like the system described in this proposal, but it would be specific to the matmul scheduler, and we should facilitate doing this more easily for all our schedulers.Proposal
I propose that we create a container class
MachineInfo
that behaves very much likestd::unordered_map<std::string, PolymorphicValue>
(except we really only needbool
,int64_t
, anddouble
so we could probably wrap a smallerDynamicType
). This would hold measured information that can be referred to in scheduler heuristics. For example, inside a normalization heuristic we might be interested in the bandwidth from global to shared memory at a given alignment so we would query it like this:In order to populate a value like this, we would curate a collection of Fusions to execute and measure their timings in order to estimate the variables of interest in a regression approach. I propose that we do that in a Python script that we provide in
tools
.The
MachineInfo
object would be serialized and loaded automatically. It could be edited by the user manually but most often it would be filled by a measurement script that would run a bunch of trial problems in order to estimate the values. We should save the nvfuser and other software versions along with device information in the serialized file and we should warn users when it is out of date, as the measured values might change based on changes to our codebase or to the installed NVRTC or driver versions.Note that this design is meant to be lightweight and "forgiving". If values are missing we should provide some reasonable defaults but warn the user that measuring them is best. We should not be sticklers about the schema because that adds friction in developing heuristics, and because we want to enable the use of user-defined data in custom heuristics.
This would not affect the
canSchedule
static methods: we would only provide this object to theSchedulerEntry
subclass constructor, e.g. to theInnerPersistentKernelScheduler
ctor.Details
Python interop
I think it makes sense for us to subclass
py::object
for ourHeuristicParams
subclasses as well as for the machine profile class so that we have a standard way to interoperate with python. This will let us write a python script to acquire the machine profile; timing can be done from python by the following steps:Storing the device profiles
The "profile" is serializable to a file that we should store in the user's home directory (
$XDG_DATA_HOME
on linux?). We can also override that location with an env var, pass it as an argument tofd.execute
, etc.Default profiles
For common datacenter devices, we could prepackage a set of profiles and include them whenever we cut a release. These will still not capture all environment details like nvrtc specifics, so we should still print a warning if we don't detect an exact match.
Related ideas
Since this proposal includes a mechanism for passing problem descriptions and heuristic params to and from Python, it also opens the door to writing heuristics directly in Python. This has been explored a little bit in #2106 for the matmul case, but this would give us a standard way to override or augment any of the built-in C++ heuristics from python. Then if users wish to experiment with their own heuristic, or even implement some form of autotuning, they could simply do so by overriding the heuristic from python.