Closed shino16 closed 2 weeks ago
test_inplace_to_arg_return_value
in thunder/tests/test_inplace_functionalization.py
currently fails if you give executors=(thunder.tests.framework.TorchCompileExecutor,)
.
Having prims.copy_(t0, a)
without any output in the trace should not be considered a valid program to be executed. The return symbol could take fake inputs that are never returned in Python execution only to build a dependency in the DAG. I expect the computation trace to be something like:
def computation(a):
# a: "cpu i64[]"
t1 = TorchCompile0(a)
# t0 = ltorch.add(a, 1, alpha=None) # t0: "cpu i64[]"
# t0 = prims.add(a, 1) # t0: "cpu i64[]"
# t1 = prims.copy_(t0, a)
return None # t1 <--- "t1" is a hidden output used only for building dataflow graph
Hi! It would be possible for the functionalizer to return the outputs of copy_
s as a 'hidden' argument of return
. Can I ask the reason why you think having copy_
with no output (and handling it as a special case) is a bad idea?
Using copy_
(and any other in-place operation) with no output is not a bad idea in general. In PyTorch Eager, there's nothing that can reorder operations but any trace transformation pass is free to do so in Thunder. All in-place operations in the trace should follow the relative ordering that is prescribed in the initial user script.
Handling copy_
as a special case in data_dependent_partition.py might not be ideal because:
copy_
but prims.copy_
, torchex.copy_
, maybe something else, and all of these similar operations need to be put in data_dependentpartition.py. Any external OperatorExecutor should be able to claim `copy` if they want to and with the current approach, it needs to be added to data_dependent_partition.py. External extensions should not be required to modify Thunder's internals.copy_
, makes the code more self-documenting and easier to understand.What do you think? I'm open to further discussion of alternative ideas or specific use cases in mind that might benefit from a different approach.
The first two reasons particularly made strong sense to me. If we're not relating copies and return in a sorter code, then a reasonable choice would be to do so as soon as functionalization generates prims.copy_
.
I'm working on implementing this!
π Bug
fuse_bound_symbols
(toposort on bsyms) putsreturn
beforecopy_
on arguments.This causes AssertionError when
torchcompile
orcudagraphex
executor is applied on in-place operations.Code sample
Before reaching this line,
TorchCompileExecution
generatesReturning
a
at the end off
does not fix this.Cause
Inside
fuse_bound_symbols
,Graph.__init__
constructs a dependency graph and apply topological sort to reorder bound symbols. The dependency is found viabsym.flat_args
andbsym.flat_outs
, andprims.copy_
has no output. Hence the algorithm does not spot the dependency betweencopy_(t0, a)
andreturn a
.Although
nvfuser
uses this function too, it gets around this by forcefully moving the return statement after all the other bound symbols (here).Related issue
229 suggested adding
TorchCompileExecutor
to the tests along withTorchExecutor
andnvFuserExecutor
.