Open jjsjann123 opened 2 years ago
Current mega container we are using is: gitlab-master.nvidia.com/kstephano/pytorch:torchbenchPerf_cg_dynamo_3
PyTorch
commit b18584c0fa3e559e2008c98bb9635d4ed87a52df (HEAD -> torchbenchPerf)
Merge: ad3cc5506e f56fc052c0
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date: Fri Oct 14 17:31:01 2022 +0000
Merge branch 'torchbenchPerf' of https://github.com/csarofeen/pytorch into torchbenchPerf
which is effectively current head of torchbenchPerf (no file diff)
| * commit f56fc052c09737ff31c0f58398243c42fec60165 (origin/torchbenchPerf)
| | Author: jjsjann123 <jiej@nvidia.com>
| | Date: Fri Oct 14 10:16:09 2022 -0700
| |
| | removing debug print
Note that: we'd want to cherry-pick two more PRs for perf reason: transpose changes: https://github.com/pytorch/pytorch/pull/86967 silu_backward changes: https://github.com/pytorch/pytorch/commit/fc3afc840784106b173c87c95b1ee96a4018bb3d
Which has Ivan's torchdynamo:
commit 1e72e26d29d7f07f13befa134785d494b6c7e830 (HEAD, test/nvfuser-cudagraphify)
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date: Wed Oct 12 16:29:37 2022 +0300
Use align_inputs from inductor
With the conv/bias decomposition there but not checked in. with a code diff, it's effectively
commit 4f4ffba4a227b332a19121eeffc0b8b490e6d22c (jiej/wip)
Merge: 5491aa4d 1e72e26d
Author: jjsjann123 <alex.jann2012@gmail.com>
Date: Thu Oct 13 16:41:59 2022 -0700
Merge remote-tracking branch 'ivan/nvfuser-cudagraphify' into conv2d_decomp
I have Ivan's PR cherry-picked. This is the new branch head
commit b3f4b3032deceabefbef681e56d99b5a6daa9da4 (csarofeen/torchbenchPerf)
Author: jjsjann123 <jiej@nvidia.com>
Date: Fri Oct 14 15:04:15 2022 -0700
cherry-picking PR https://github.com/pytorch/pytorch/pull/86967
commit 5863911099541f2a16d4699d6815850dae693cb1
(ivan/nvprims-transpose-partitioner)
Merge: 73669a71003 fc3afc84078
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date: Fri Oct 14 22:58:55 2022 +0300
Merge branch 'master' into nvprims-transpose-partitioner
```
Note that some side issues that we've been discussing, would want to track these with upstream folks so we'll have better debuggability in future perf analysis:
Set
, which could give us fusion partition that could ended up in wrong order. I tend to think this one is safe, since partitioner is only supposed to be queried
if a given node is within a partition, but this is implementation specific, I should double check it before coming to a conclusion. Alternatively, it should be easy to switch it to a Dict
, which is deterministic.Upstream has switched backends for benchmark runs: https://github.com/pytorch/pytorch/pull/88437
linking it here to track our progress of upstreaming fixes.
Note for myself. convolution(_backward) bias decomp seems to be not helping with our benchmark https://github.com/pytorch/torchdynamo/pull/1645 <- so we won't be pushing this to upstream.
🐛 Describe the bug
Just a common place tracking all issues that we run into with our benchmark.
Functional Issues
RuntimeError: rhs_i >= 0 && lhs_i >= 0 INTERNAL ASSERT FAILED
#2053 fixed inbug_fixes
branch although not yet merged.C++ exception with description "it != replay_CasP.getReplay().end() INTERNAL ASSERT FAILED
#2064 Not view related, also repros on develfalse INTERNAL ASSERT FAILED ... compute_at_map.cpp ... Concrete ID failed to cover all root IDs. IDs ...
#2066Reducing a tensor once it's gone under transformations is not permitted at this time. Please set reductions before calling split/merge/computeAt
#2067!hasSelfMapping() INTERNAL ASSERT FAILED ... Unsupported domain mapping detected ...
#2068New bugs round 2 with repro under code base: https://github.com/csarofeen/pytorch/issues/2065#issuecomment-1279422075
Vectorized accesses cannot be inline with computation, they are only supported with a Set operation.TensorView:
#2074Validation failed on tensor ... The expanded dim and the dim before it can not be contiguous.
#2075index_map.find(root_dom[i]) != index_map.end() ... Couldn't find root mapping
#2076 (fixed in #2089, but not yet merged)Illegal Cast value from DataType: __half to DataType: bool
#2077producer->getMemoryType() == MemoryType::Global INTERNAL ASSERT FAILED
#2080error: no operator "xxx" matches these operands ... operand types are: CudaCodeGen::__half ...
#2088 Note that the issue is still real, but with nvprims.native_batch_norm dtype promotion update, we no longer run into this in benchmarksMissing Cast Op
#2087Rfactor replay recieved an axis outside the number of dims in the tensor
#2094thread_predicates_.find(tv_inp) != thread_predicates_.end() INTERNAL ASSERT FAILED ... Thread predicate map was not initialized, couldn't find...
#2110expected scalar type Float but found Half
#2115root_vals.find(inp) != root_vals.end()
#2116Needs more triage to determine who should fix:
Misaligned Address Error
#2079Performance Issues
_to_copy()
operations #2098Versions
torchbenchPerf
branchNote that all issues are repro'ed in the commit above. We might be cherry-picking more commits with fixes, any new issue linked reported here should include their commit if it differs.