Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

X86 ChromeOS devices with kernel 4.4 fails to boot after 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 #46437

Closed Quuxplusone closed 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47468
Status RESOLVED FIXED
Importance P enhancement
Reported by Jian Cai (caij2003@gmail.com)
Reported on 2020-09-08 17:56:10 -0700
Last modified on 2020-09-22 02:45:14 -0700
Version trunk
Hardware PC All
CC caij2003@gmail.com, cmtice@google.com, craig.topper@gmail.com, dblaikie@gmail.com, dmajor@bugmail.cc, hans@chromium.org, jyknight@google.com, llozano@chromium.org, llvm-bugs@lists.llvm.org, manojgupta@google.com, natechancellor@gmail.com, ndesaulniers@google.com, srhines@google.com
Fixed by commit(s)
Attachments free_user_ns__user_namespace.ll (35937 bytes, text/plain)
repro.ll (1150 bytes, text/plain)
Blocks PR4068, PR46725
Blocked by
See also
While building with recent LLVM versions, we found X86 ChromeOS images with
kernel 4.4 all failed to boot. We bisected it to the following commit.

commit 4b0aa5724feaa89a9538dcab97e018110b0e4bc3
Author: James Y Knight <jyknight@google.com>
Date:   Fri May 15 23:43:30 2020 -0400

Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.

Link: https://crbug.com/1123683
Quuxplusone commented 4 years ago

60433c63acb71935111304d71e41b7ee982398f8 (Remove TwoAddressInstructionPass::sink3AddrInstruction. ) was supposed to be a fix but as per https://crbug.com/1123683 , the boot issue is still present after it.

Quuxplusone commented 4 years ago

Are you saying that the issue described in https://github.com/ClangBuiltLinux/linux/issues/1085 was not actually fixed?

Or is this a distinct issue that didn't exhibit there for some reason?

Quuxplusone commented 4 years ago

Right, I applied https://reviews.llvm.org/rG60433c63acb71935111304d71e41b7ee982398f8 on http://crrev.com/c/2372837/14 and the issue still reproduced. Another later version LLVM that included 0433c63acb71935111304d71e41b7ee982398f8 also had the same boot failure http://crrev.com/c/2361864/2.

Quuxplusone commented 4 years ago

The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

Can you give some more detail? I'm not familiar with chrome's build system.

Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

Quuxplusone commented 4 years ago

< The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

< Can you give some more detail? I'm not familiar with chrome's build system.

Sorry an example of the failure can be found as follows. Search for "System rolled back" in the log.

https://00e9e64bac44fd8b1f073164b070e39c7d3b67d7d5b582cb85-apidata.googleusercontent.com/download/storage/v1/b/chromeos-autotest-results/o/swarming-4e5b5cee0a6c6a10%2Fprovision_chromeos6-row6-rack4-host11%2Fdebug%2Fautoserv.DEBUG?qk=AD5uMEs2QtbXhnrDyvEWABvDxkI31TZKCEntN_pMZC_YxVXitalTlHGSfO1zzo_oipdCAkEu61HjA6U3J8F7Ic7xGE9xfpc1eZ6PPWPbKnylwnx7qVIENybBBiRRXlmPaIBwFzWsapdcd5efVuhJeTND3ElAb_1zKOnAkN9aKrE6n1tnMXzA3iVNRSX_DsLyEfufkGd8MyE9J6Yy2Jyqx9WFyHleb-_e-Mzpj9B40zsrihD5YUX1Qd_Pwaoi_r3yCatzHTBDLMncK4Kd7pWYcVrI6i3cwsJONChaqC7iWFkCD9i572ctlCLZ3ARCpDKNVjENXtpV9awJGVihZrHEXS4rR-SsTO7dfqODQa7GxovexnOfTb9xqMDhjCuoZ4pqY6Vfcaw9vgfqm3dYFop2bhZWDiGTZOPJHLPc4MLUEuW0T7mxg59bQQxl47caJYP6579czEGIlGgojj6qn_GdLs2QwRwNOICUMcqYGmDGaCQZRdnw59xDLUHD9qWrR8lr-QORk86qNRnIdPlwa3eXQI7eXSm2MmvGyCTX-YS5Z_PSp2-qSirYAv5fIcOetZL12vV-0rjurU-TN3GJ9zxBuJqzbwZSvyVXQ8q5WwNWkia7a-oM_yRShQGBCnzN9xBKS_7B8a_m8TNdRSycpzKzlt6MNGAmGYz-oYDMmtEwRSysgoW8qRNZ_7pc8UHCnZ9NyrPKOzLjkk96pBcdQ1677n5rRRcuE68SGd3mHNUN8C5AGG6k-un_Av5iNMCX3sMeRHRn3eL27nFaHZGaTZ_PRqhLCdZd-B0i95_YKJnW1NkGa5QMBtnn1Um9XnFs4ZDVgf0Z0apiQ8aDPaRnrzWNTpwCiYgEL_PiSMTtQKXJ8Rz4zCOr0aLBSK2v51yxSkJPkser3lBFmsq3RtOyArhuZ4KuyhZJosDi1Wpi-Ti9DnzLq6zwRqt6Bms&isca=1

< Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

SG. I'm working with Nick on this.

Quuxplusone commented 4 years ago

Is there any news on this?

Quuxplusone commented 4 years ago

We are still actively debugging this. We have found that the devices actually boot the first time but reboot when system UI is restarted for any reason e.g. tests. And the problem indeed starts from the mentioned LLVM patch.

Lack of access to a physical device has unfortunately made debugging really slow.

Quuxplusone commented 4 years ago

As it's currently looking, I don't think we can hold the release for this.

Quuxplusone commented 4 years ago

We were able to get physical access to the device.

We have bisected down to the object file affected.

I'm working through pinpointing where things are going wrong now.

Quuxplusone commented 4 years ago

Specifically, the linux-4.4.y branch of the stable tree of the Linux kernel with CONFIG_USER_NS enabled; kernel/user_namespace.o seems to the the culprit of an object file bisection performed by Jian. Still digging in what's broken about it.

Quuxplusone commented 4 years ago

Comparing the disassemblies between:

  1. commit 78c69a00a4cff786e0
  2. commit 78c69a00a4cff786e0 + cherry pick 4b0aa5724feaa + cherry pick 78c69a00a4cff

I can see a few differences, which look pretty obviously wrong to me, for example the relatively small fn free_user_ns:

example:

-     2ab: 4c 89 f3                             movq    %r14, %rbx
-     2ae: f0                                   lock
-     2af: 41 ff 8e c0 00 00 00                 decl    192(%r14)
-     2b6: 74 d2                                je      0x28a <free_user_ns+0xa>
+     2ab: f0                                   lock
+     2ac: 41 ff 8e c0 00 00 00                 decl    192(%r14)
+     2b3: 74 d5                                je      0x28a <free_user_ns+0xa>
+     2b5: 4c 89 f3                             movq    %r14, %rbx

so the mov was occurring unconditionally before the dec, now it's conditional (oops) and after the dec (double oops)

Since free_user_ns is relatively small, should make the reproducer concise and getting it easy.

Quuxplusone commented 4 years ago

Attached free_user_ns__user_namespace.ll (35937 bytes, text/plain): free_user_ns__user_namespace.ll

Quuxplusone commented 4 years ago

Attached repro.ll (1150 bytes, text/plain): repro.ll

Quuxplusone commented 4 years ago
fwiw, dumping the output of llc -O2 -print-before-all repro.ll -o - to a file
and comparing that before vs after the changes in question, things start to go
visibly wrong after

Eliminate PHI nodes for register allocation

pass.  We have:

-  $rbx = MOV64rr $r14 ...
-  INLINEASM_BR ...
+  INLINEASM_BR ...
+  $rbx = MOV64rr killed $r14

(so we were doing mov then inline asm blob; after we have those reversed, which
is bad)
Quuxplusone commented 4 years ago
The bug is in llvm::findPHICopyInsertPoint.

We're correctly hitting the special case for placement in a block with asm-goto
or invoke. Then, the code places the copy after the last use or def of the
register. Unfortunately, placing after the last use is wrong -- inlineasm_br
uses the register, in this example. Yet, the whole point of the special-case is
to find a place to insert the copy _before_ that instruction.

I believe just placing the copy after last DEF, not after the last use or def,
will be a correct fix. E.g.:

--- a/llvm/lib/CodeGen/PHIEliminationUtils.cpp
+++ b/llvm/lib/CodeGen/PHIEliminationUtils.cpp
@@ -34,7 +34,7 @@ llvm::findPHICopyInsertPoint(MachineBasicBlock* MBB,
MachineBasicBlock* SuccMBB,
   // Discover any defs/uses in this basic block.
   SmallPtrSet<MachineInstr*, 8> DefUsesInMBB;
   MachineRegisterInfo& MRI = MBB->getParent()->getRegInfo();
-  for (MachineInstr &RI : MRI.reg_instructions(SrcReg)) {
+  for (MachineInstr &RI : MRI.def_instructions(SrcReg)) {
     if (RI.getParent() == MBB)
       DefUsesInMBB.insert(&RI);
   }

However, that will have the effect of placing the copy earlier than necessary
in the block, which might increase register pressure. I'm going to try to make
a slightly better fix.
Quuxplusone commented 4 years ago

Thanks for the fix. I've verified locally it fixed the issue in a test device.

Quuxplusone commented 4 years ago

I've got the "Slightly better fix" posted for review now: https://reviews.llvm.org/D87865

Please verify this works for you as well, thanks!

Quuxplusone commented 4 years ago
Thanks!

Jian and Manoj are testing this on CrOS kernels.
Nathan is testing this on various linux kernel distro's configs.
I'm testing this with kernel patches that use asm goto w/ outputs, and our 65
various kernel tree and configs: https://travis-
ci.com/github/ClangBuiltLinux/continuous-integration.

It will take me at least an hour and a half to two hours to get through all of
those.

Hans, I suspect we can have the patch ready and well qualified by tonight (US
time, maybe next 6-8 hours).
Quuxplusone commented 4 years ago

I verified https://reviews.llvm.org/D87865 fixed the issue locally.

Quuxplusone commented 4 years ago

The verification I've done so far is limited to one test device. Will start a more complete test with the patch.

Quuxplusone commented 4 years ago

All of my tests with this on ToT LLVM look good. Going to rerun them again with this on top of release/11.x.

Quuxplusone commented 4 years ago

All of my kernel build+boot tests with release/11.x + https://reviews.llvm.org/D87865 passed.

Quuxplusone commented 4 years ago

Fix committed to master as f7a53d82c0902147909f28a9295a9d00b4b27d38, which cherry-picks cleanly to 11.x.

@Hans: shall I push to 11.x?

Quuxplusone commented 4 years ago
Manoj reports the CrOS tests look good:

https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/2417398/3#message-18cdbb29241e515a94ffc182889a7c9c2b335ec7

Nathan reports clang-11+patch and ToT+patch look good:

tot: https://paste.myself5.de/raw/aciqosofos
clang-11: https://del.dog/raw/xapegagywo
(tot showed one failure, but that's due to a missed cherry pick for an
unrelated breakage upstream in kernel sources)

The fixed was merged on master in:
https://reviews.llvm.org/rGf7a53d82c0902147909f28a9295a9d00b4b27d38

Let's get this cherry-picked over to release/11.x and close this out. Hans?
Quuxplusone commented 4 years ago
Manoj notes that the fix will depend on

commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to
SmallPtrSet")

which doesn't look like it's in release/11.x to me.

So we'll likely want to cherry pick the following 2 patches to close this out:

1:
commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to
SmallPtrSet")

2:
commit f7a53d82c090 ("PR47468: Fix findPHICopyInsertPoint, so that copies
aren't incorrectly inserted after an INLINEASM_BR.")
Quuxplusone commented 4 years ago

Cherry-picked to 11.x as 410b0dc84bbdafabe3a2c3eedd96e50340a6e0d0 and 6250d4944539f67d6a605928e97c087fe306a79e. Thanks everyone!