Mega issue tracking `torchbenchPerf` on benchmark runs

jjsjann123 commented 2 years ago

🐛 Describe the bug

Just a common place tracking all issues that we run into with our benchmark.

Functional Issues

[x] RuntimeError: rhs_i >= 0 && lhs_i >= 0 INTERNAL ASSERT FAILED #2053 fixed in bug_fixes branch although not yet merged.
[x] C++ exception with description "it != replay_CasP.getReplay().end() INTERNAL ASSERT FAILED #2064 Not view related, also repros on devel
[x] false INTERNAL ASSERT FAILED ... compute_at_map.cpp ... Concrete ID failed to cover all root IDs. IDs ... #2066
[x] Reducing a tensor once it's gone under transformations is not permitted at this time. Please set reductions before calling split/merge/computeAt #2067
[x] !hasSelfMapping() INTERNAL ASSERT FAILED ... Unsupported domain mapping detected ... #2068

New bugs round 2 with repro under code base: https://github.com/csarofeen/pytorch/issues/2065#issuecomment-1279422075

[x] Vectorized accesses cannot be inline with computation, they are only supported with a Set operation.TensorView: #2074
[x] Validation failed on tensor ... The expanded dim and the dim before it can not be contiguous. #2075
[x] index_map.find(root_dom[i]) != index_map.end() ... Couldn't find root mapping #2076 (fixed in #2089, but not yet merged)
[x] Illegal Cast value from DataType: __half to DataType: bool #2077
[x] producer->getMemoryType() == MemoryType::Global INTERNAL ASSERT FAILED #2080
[x] error: no operator "xxx" matches these operands ... operand types are: CudaCodeGen::__half ... #2088 Note that the issue is still real, but with nvprims.native_batch_norm dtype promotion update, we no longer run into this in benchmarks
[ ] Missing Cast Op #2087
[ ] Rfactor replay recieved an axis outside the number of dims in the tensor #2094
[ ] thread_predicates_.find(tv_inp) != thread_predicates_.end() INTERNAL ASSERT FAILED ... Thread predicate map was not initialized, couldn't find... #2110
[ ] expected scalar type Float but found Half #2115
[ ] root_vals.find(inp) != root_vals.end() #2116

Needs more triage to determine who should fix:

[x] [bug disappeared] Misaligned Address Error #2079

Performance Issues

[x] (Jie) bookend views which hurts perf #2090
[x] (Ivan) Extra kernels prior to an embedding lookup in Bert #2095
[ ] Log Softmax Fusion with Autocast has bad perf #2096
[ ] (Kevin)Bad Loss Function Perf #2097
[x] (Ivan/Kevin) CudaGraph Capturing failures on _to_copy() operations #2098
[ ] (Naoya) Softmax Perf Issue. #2125
[x] (Kevin) Tanh based Gelu defined in HuggingFace is seeing to many saved Tensors #2126
[ ] DistilBertForMaskedLM: Problem where the forward pass doesn't get fused due to presence of ConstantTensor https://github.com/pytorch/pytorch/issues/84415 (works to remove check to disallow deepcopy but not sure why)
[ ] (Ivan) Tensor Args Caching: https://github.com/pytorch/pytorch/pull/87860

Versions

torchbenchPerf branch

commit 94d159f5f1306fbfff4ae43ccd196a0675317889 (csarofeen/torchbenchPerf)
Author: jjsjann123 <jiej@nvidia.com>
Date:   Wed Oct 12 17:30:17 2022 -0700

    patching dfs by excluding dfs within partition

Note that all issues are repro'ed in the commit above. We might be cherry-picking more commits with fixes, any new issue linked reported here should include their commit if it differs.

jjsjann123 commented 2 years ago

Current mega container we are using is: gitlab-master.nvidia.com/kstephano/pytorch:torchbenchPerf_cg_dynamo_3

PyTorch

commit b18584c0fa3e559e2008c98bb9635d4ed87a52df (HEAD -> torchbenchPerf)
Merge: ad3cc5506e f56fc052c0
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Fri Oct 14 17:31:01 2022 +0000

Merge branch 'torchbenchPerf' of https://github.com/csarofeen/pytorch into torchbenchPerf

which is effectively current head of torchbenchPerf (no file diff)

| * commit f56fc052c09737ff31c0f58398243c42fec60165 (origin/torchbenchPerf)
| | Author: jjsjann123 <jiej@nvidia.com>
| | Date:   Fri Oct 14 10:16:09 2022 -0700
| | 
| |     removing debug print

Note that: we'd want to cherry-pick two more PRs for perf reason: transpose changes: https://github.com/pytorch/pytorch/pull/86967 silu_backward changes: https://github.com/pytorch/pytorch/commit/fc3afc840784106b173c87c95b1ee96a4018bb3d

TorchDynamo

Which has Ivan's torchdynamo:

commit 1e72e26d29d7f07f13befa134785d494b6c7e830 (HEAD, test/nvfuser-cudagraphify)
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Oct 12 16:29:37 2022 +0300

    Use align_inputs from inductor

With the conv/bias decomposition there but not checked in. with a code diff, it's effectively

commit 4f4ffba4a227b332a19121eeffc0b8b490e6d22c (jiej/wip)
Merge: 5491aa4d 1e72e26d
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Thu Oct 13 16:41:59 2022 -0700

    Merge remote-tracking branch 'ivan/nvfuser-cudagraphify' into conv2d_decomp

jjsjann123 commented 2 years ago

I have Ivan's PR cherry-picked. This is the new branch head

commit b3f4b3032deceabefbef681e56d99b5a6daa9da4 (csarofeen/torchbenchPerf)
Author: jjsjann123 <jiej@nvidia.com>
Date:   Fri Oct 14 15:04:15 2022 -0700

    cherry-picking PR https://github.com/pytorch/pytorch/pull/86967

commit 5863911099541f2a16d4699d6815850dae693cb1
(ivan/nvprims-transpose-partitioner)
Merge: 73669a71003 fc3afc84078
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Fri Oct 14 22:58:55 2022 +0300

    Merge branch 'master' into nvprims-transpose-partitioner
```

jjsjann123 commented 2 years ago

Note that some side issues that we've been discussing, would want to track these with upstream folks so we'll have better debuggability in future perf analysis:

aot_autograd traced graph differs between runs. -> pain for debugging, especially given that our fusion_id also doesn't align in those.
partitioner uses Set, which could give us fusion partition that could ended up in wrong order. I tend to think this one is safe, since partitioner is only supposed to be queried if a given node is within a partition, but this is implementation specific, I should double check it before coming to a conclusion. Alternatively, it should be easy to switch it to a Dict, which is deterministic.

jjsjann123 commented 2 years ago

Upstream has switched backends for benchmark runs: https://github.com/pytorch/pytorch/pull/88437

linking it here to track our progress of upstreaming fixes.

jjsjann123 commented 2 years ago

Note for myself. convolution(_backward) bias decomp seems to be not helping with our benchmark https://github.com/pytorch/torchdynamo/pull/1645 <- so we won't be pushing this to upstream.

csarofeen / pytorch