NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
249 stars 49 forks source link

Invalidate entries in the fusion cache when full #1710

Open tfogal opened 7 months ago

tfogal commented 7 months ago

When a user constructs too many fusions today, the fusion cache overflows and nvFuser throws up its hands and gives up. Any user could theoretically hit this, but we acutely feel the pain in CI, where an extensive test suite has grown the fusion cache to its maximum size. The temporary workaround of growing the cache (#1702) is not a true solution.

The salient elements of the cache are a pair of linked data structures: a trie of the fusion elements themselves (i.e. the ops) and then a paired vector<FusionSchedule*> (FusionCache::fusions_) that has the actual cached element. Each node in the trie corresponds to a set of ops "up to" that node. For example, if there are two entries in the cache, corresponding to mul-add-reduction and mul-add-division, then the trie is conceptually something like this:

           reduction
          /
mul - add 
          \
           division

where there are FusionSchedules attached to the trie at the reduction and division nodes (in the implementation, these are actually RecordType::End nodes). Every such End node will then have an associated "index" (TrieNode::fusion_id) that indexes into the FusionCache::fusions_ array.

A more straightforward implementation would be a hash table for the set of fusions. The hash table implementation may have additional overheads due to collisions, however: if we hash a fusion to, say, 42, we can't assume that the fusion at element 42 is the same as what we need---we need to then check each op in the fusion and see if it matches with each op in element 42 of the table, because multiple fusions might hash to 42. Such a check is O(n) in the list of ops of the fusion. Traversing the trie is O(n) in the list of ops of the fusion, but we know immediately when we get to the end whether we've found a match or whether it's not available.

Some things to investigate:

tfogal commented 7 months ago

Assigning to @rdspring1. Not intending to imply he should do everything here, but because we talked about this I want to be sure he sees + edits the above and ensures I did not forget anything.

rdspring1 commented 7 months ago

@tfogal Your summary looks good!

Here are some other items from my notes:

  1. We should create a test for this in python_test/test_python_frontend.py, since debugging the fusion cache through the CI is cumbersome. The test would artificially lower the cache limit to three and test how the cache behaves when we reach the cache limit.
  2. When we remove a fusion from the Trie, we need to prune dead record functors if they do not have any more children.
  3. If we keep the vector<FusionCache*> structure, we need to decouple the vector's indices from the fusion's id.
tfogal commented 7 months ago

Here are some other items from my notes:

Thanks Ryan! I edited these into the main issue, to make it easy for someone to pick this up w/o following all our comments.

kevinstephano commented 7 months ago

We did not implement an LRU Cache eviction policy since it was not needed when we first started using the Python Interface and had other things to do. Therefore, it was designed to simply assert when the number of node entries reaches the max. The max was set at 8192 to pass our tests. 8192, in reality, is huge for s single model. The cases where we get in trouble are when we run back-to-back tests and the cache does not get reset.

kevinstephano commented 7 months ago

Note, a lot of the testing is done at the C++ level. https://github.com/NVIDIA/Fuser/tree/main/csrc/python_frontend/test