Open nickalcock opened 2 years ago
Yes, it would be good to compare binaries with good ones. diffoscope
can often show what's wrong.
Do you have a root on some box? Then you could try to also see if chroot and qemu modes work.
P.S. also feel free to visit #bootstrappable on libera.chat if you want more interactive help.
Exactly. I'll see if I can generate good binaries via qemu, though I'd be surprised if that would work: I do wonder if the problem is file sort order or something, i.e. down to the underlying filesystem, xfs versus qemu? hmm that's easy to test, will do. (I have root across this local network, so that should be good enough. It looks like qemu mode doesn't do anything dangerous or crazy. Well, more dangerous and crazy than this project as a whole :) )
Confirmed that it only goes wrong under --bwrap. Still trying to figure out where --qemu mode writes to so I can diffoscope the artifacts: it's not writing to tmp or sysc/tmp or even sysc/tmp/disk.img even with the tmpfs mounting forcibly disabled.
Confirmed that it only goes wrong under --bwrap. Still trying to figure out where --qemu mode writes to so I can diffoscope the artifacts: it's not writing to tmp or sysc/tmp or even sysc/tmp/disk.img even with the tmpfs mounting forcibly disabled.
qemu
mode runs in tmpfs
during sysa stage and later on the virtual disk in sysc stage, so getting artifacts out is a bit tricky. You would have to transfer them to sysc first... Perhaps it would be easier if I publish good file somewhere.
What about chroot mode? Does that give your correct checksum?
But it might indeed be related to underlying filesystem...
A bit more info: not creating a tmpfs in the bwrap stage makes the error go away! (to be replaced by another error, which isn't too surprising after I did that). So this must be a difference in the behaviour of tmpfs between the sysa qemu image (which for me is based on a 5.10.0 defconfig kernel) and the host kernel (5.16.19 tmpfs, 64-bit). I'll arrange to copy the file in question off the tmpfs before deleting it... let's see.
Failure confirmed intermittent, happening about 50% of the time. The difference is that my faulty copy of libtcc1.a has four more null bytes at the end. This seems to be pure padding: it's not represented in the size of the archive's lone element at all. This almost has to be something up with tcc 0.9.26's tcc_tool_ar, I'd think.
Hmm, that is strange. I was expecting something like ordering issue and not padding.
Perhaps another useful data point would be to check if that happens with all libtcc1.a stages. Unfortunately, we only checksum the last one but mescc->tcc-0.9.26 step actually involves 5 rebuilds.
Good idea: I can use the same "stuff a cp into sysc_image/tmp" kludge I used for this to smuggle all five out in both the failing build and the qemu build. I'll look at it once this stupid cold has gone away :(
@nickalcock does this still reproduce?
Seen with trunk in my first attempt to do a live-bootstrap with this package. 64-bit x86-64 box, building with bwrap via:
PATH=/usr/src/live-bootstrap/bwrap:$PATH ./rootfs.py --bwrap
(I have to point PATH through a directory that contains a non-setuid bwrap because the setuid one refuses to allow CAP_SETPCAP wrapping.)
Here's the end of the bootstrap process, including at least one thing that had a correctly-validated checksum:
I don't know where to start debugging this because I don't have an instance that works to work from. Clearly codegen is broken, but where? (I can provide the probably-broken binaries to anyone who wants them.)