NVIDIA / TensorRT-Incubator

Experimental projects related to TensorRT
81 stars 12 forks source link

[Tripy] Return the shape immediately if it is statically known instead of producing a trace operator. #379

Open slyubomirsky opened 1 week ago

slyubomirsky commented 1 week ago

Addresses issue #360.

slyubomirsky commented 1 week ago

When informally benchmarking, I did not find a large difference between the current implementation and this change, however. Perhaps the issue is how the benchmark was measured. Informal benchmark:

The function to test:

def func(t1, t2):
    return tp.Tensor(list(t1.shape) != list(t2.shape), dtype=tp.bool)

I wrote it this way because the _is_compatible helper in Parameter returns a Python bool, but tp.compile can only compile functions returning a Tripy tensor.

Comparison:

c = tp.compile(func, args=[tp.InputInfo(shape=[1, 3, 3], dtype=tp.int32), tp.InputInfo(shape=[1, 3, 3], dtype=tp.int32)])
t1 = tp.Tensor([[[1, 2, 3], [4, 5, 6], [7, 8, 9]]])
t2 = tp.Tensor([[[1, 2, 3], [4, 5, 6], [7, 8, 9]]])

def time_run():
    start = time.perf_counter_ns()
    for _ in range(1000):
        c(t1, t2)
    end = time.perf_counter_ns()
    return (end - start) / 1.0e6

After warming up, both complete in about 25 microseconds. Should we do this comparison without compiling instead? I could also vary the input values.

pranavm-nvidia commented 1 week ago

@slyubomirsky most of the impact will be on the compile time I think.

slyubomirsky commented 1 week ago

Ah, okay. I'll profile compile time then.

slyubomirsky commented 6 days ago

FWIW, some cursory benchmarking also did not reveal much change in the compile time either (a difference about 100 microseconds over 1000 trials), though it might be a matter of the function that is compiled.

The function whose compilation I timed (the return value needed to be a tensor for it to be valid to submit to tp.compile):

def func():
      return tp.concatenate(tp.Tensor([1, 2, 3, 4].shape, dim=0)

If there's a specific benchmark we have in mind, that might better to measure.

pranavm-nvidia commented 4 days ago

@slyubomirsky I think a benchmark where it could make a difference is setting parameters on modules without involving any Tripy operations, e.g.:

module.param = tp.Parameter([1, 2, 3])

Previously, this would have triggered a compilation in order to compare the shape of the new parameter with that of the old one. With your changes, the compilation should be entirely bypassed.

slyubomirsky commented 4 days ago

Comparing the parameter update time directly does yield the expected difference 🙂 100x faster due to not compiling.