NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
303 stars 99 forks source link

Ruby >32GB memory usage, out of memory #260

Closed tanner-andrulis closed 1 month ago

tanner-andrulis commented 2 months ago

In the timeloop-accelergy-exercises repo, try the following: python3 run_example_designs.py --architecture simba_like --problem msft_phi_1_5/096.yaml

This uses 10s of gigabytes on my machine, eventually leading to an out-of-memory error. If I edit the intermediate files to change the mapspace to uber, it works fine. I think the problem is in the ResidualFactors calculation.

angshuman-parashar commented 2 months ago

I believe Ruby does an initial pruning step that's O(|mapspace|) in time-complexity. Perhaps it's also using an O(|mapspace|) data structure?

@MarkHoreni could you comment? Is this fundamental or a possible bug?

MarkHoreni commented 2 months ago

This is fundamental; the SIMBA_like architecture has an extra level with M compared to Eyeriss. With a dimension size of 6144, even with pruning, this still explodes. I have some workarounds currently architecture-specific for larger tensor sizes like this, but I have not worked much on generalizing it yet (but I have a feeling it can be done easily; it just would take time).

tanner-andrulis commented 2 months ago

It is my understanding (please correct if wrong) that given a dimension of 6144, Ruby will try every factor from 1..6144.

However, my intuition is that most of these factors will not be useful. For example, the following two loops seem to me like they'd have equivalent stats, except that the second will require a greater storage capacity:

for P in [0..2):
  for P in [0..3077):
for P in [0..2):
  for P in [0..3078,3076):

My questions are:

MarkHoreni commented 2 months ago

So there are two pruning steps; one step prunes out those useless factors, as you just described, as shown in numeric.cpp. The math is actually a bit stronger than just ceil(D/N) because not every residual factor is useful for every product factor, and the lowest residual factor possible is usually the optimal one. You see this around L399.

The problem that's being faced is that on the next step, even more pruning must be done because you can have each of these factors at any level, creating a large cross product. This can be additionally pruned by subselecting product factors based on the nth root and then permuting them across each level.

So, for example, if I have a factor of 3078, I know that that's the only one that can be implemented. That set gets added and permuted across all the tiling levels. I also know that to have a new set of product factors, I have to take the square root of 3078. If I were to just divide by 2, you get the explosion you're seeing, and this is true all the way up until you take the square root (An intuitive way to think about this is that what is the minimum possible number to multiply have in order to get between D, and 2D, because anything greater than 2D can be simplified). You see these conditions around L436.

tanner-andrulis commented 2 months ago

Thanks a lot for the detailed responses Mark. As a half-solution for these types of problems, do you think it'd be possible to turn off Ruby's imperfect factorization search while preserving user-defined imperfect constraints? As in, the user can set an imperfect constraint, but beyond that the mapspace would only search over prime factors.

For example, if the problem dimension is N=23 and the user writes a constraint of P=4 with residual 3, then the mapspace would only search over perfect factors that remain, so we'd get 3 and 2.

angshuman-parashar commented 2 months ago

Thanks a lot for the detailed responses Mark. As a half-solution for these types of problems, do you think it'd be possible to turn off Ruby's imperfect factorization search while preserving user-defined imperfect constraints? As in, the user can set an imperfect constraint, but beyond that the mapspace would only search over prime factors.

For example, if the problem dimension is N=23 and the user writes a constraint of P=4 with residual 3, then the mapspace would only search over perfect factors that remain, so we'd get 3 and 2.

I wonder if this would be easier to implement in the Uber mapspace rather than Ruby.

MarkHoreni commented 2 months ago

Thanks a lot for the detailed responses Mark. As a half-solution for these types of problems, do you think it'd be possible to turn off Ruby's imperfect factorization search while preserving user-defined imperfect constraints? As in, the user can set an imperfect constraint, but beyond that the mapspace would only search over prime factors. For example, if the problem dimension is N=23 and the user writes a constraint of P=4 with residual 3, then the mapspace would only search over perfect factors that remain, so we'd get 3 and 2.

I wonder if this would be easier to implement in the Uber mapspace rather than Ruby.

I agree with this. Also, user-defined imperfect factors were never implemented even in Ruby.

tanner-andrulis commented 2 months ago

I have implemented user-defined imperfect factors in Ruby, so most of the work would already be done if we go that route. I'm seeing that Ruby searches first for perfect factors then imperfect ones... Do you think it'd be feasible for there to be a toggle for the second step?

MarkHoreni commented 2 months ago

Yes, I think that would be an okay idea because you would essentially skip the residual loop entirely and pass through all of the imperfect factors. However, since imperfect factors are possible in the entire mapping, it would still have to iterate through non-prime factors, but everything not user-defined can be set to product factor = remainder factor.

tanner-andrulis commented 2 months ago

Thanks a lot! I'm working on doing some implementing, will get back to you if I have more questions. I appreciate it.

tanner-andrulis commented 1 month ago

@MarkHoreni Thank you for the advice. I've implemented imperfect factors in Uber.