bedrocklinux / bedrocklinux-userland

This tracks development for the things such as scripts and (defaults for) config files for Bedrock Linux
https://bedrocklinux.org
GNU General Public License v2.0
602 stars 65 forks source link

brl status says bedrock is broken and i cant repair it or fetch on arch 0.7.23 Poki #237

Closed asmitir closed 2 years ago

asmitir commented 2 years ago

After doing brl status, it shows that bedrock is broken. i can't fetch, update or repair, it doesnt display any errors.

how could i fix that? please help

paradigm commented 2 years ago

It looks like you've spammed this support request all over the place: here, LQ, multiple reddit threads, and possibly other places I've missed. If you don't mind deleting all but one, I can help you in wherever that one place is. This happened to be the first one I saw, and so I picked here arbitrarily; we can move to another if you prefer.

I've not seen this issue before and don't have an obvious lead to follow. Lets start with a broad drag net. Run

/bedrock/bin/brl report /tmp/log

then provide me the contents of /tmp/log. I'll walk through them and see if anything sticks out.

asmitir commented 2 years ago

hello, thanks for the answer. about the multiple requests, sure lets proceed here.

here are the contents of /tmp/log: https://pastebin.com/ffjf8ryF

paradigm commented 2 years ago

Some background on what brl status is complaining about here:

Bedrock makes heavy use of mount points to redirect filesystem requests to ensure they go to the right place. Mount points can have a proper called "shared" (or perhaps "shared subtree") in which new mount and unmount operations within that mount get propagated to other mount point; see here and here if you're curious about the specifics, but for debugging this it probably doesn't matter beyond understanding that there's an attribute set on mount points.

Bedrock is picky about where this shared attribute is set. Your brl status output indicates four mount points have it set that brl status doesn't think should. From your brl report log, it also looks to me like this attribute is indeed set on those mount points for some reason and your system is properly configured to not want them set; it doesn't look like brl status is hitting a false-positive.

Normally brl repair would be able to remove this shared attribute on such mount points and make brl status happy. Lets enable debug on brl repair and see where it's going astray. With root permissions, open /bedrock/libexec/brl-repair. On the very first blank line, right under the copyright comment block, add set -x. With a bit of context that area should look like:

# Repairs broken strata
set -x
. /bedrock/share/common-code

Then run

brl repair 2>&1 | tee /tmp/log2

and provide me /tmp/log2.

asmitir commented 2 years ago

ok here is the result of /tmp/log2: https://pastebin.com/JV4jNKuK

and here is how it looks like: https://imgur.com/e8sa4Uz.png

paradigm commented 2 years ago

Your set -x placement looks good. However, I dropped a term in my previous instructions. I need

brl repair bedrock 2>&1 | tee /tmp/log2

with the bedrock stratum specified. Give that a go and I'll take a look.

asmitir commented 2 years ago

another error occured this time :/ https://pastebin.com/RQEW4swk

running as sudo.

paradigm commented 2 years ago

Before we forget, remove that set -x set made earlier. It's no longer needed.

Did the debug brl repair run exit (so you get another prompt) or did it hang (so you have to ctrl-c to make it stop)? My bet is it hung.

Some Bedrock commands could fight with each other if they run at the same time. To make sure this doesn't happen, Bedrock uses a lock file to force newly launched Bedrock commands to wait until any running ones have finished. It does this via the flock command. Looking back at your brl report log, you had four flock commands queued up.

What I think is happening here is something is grabbing the lock and not releasing it. This means any other commands that need the lock just sit there waiting indefinitely. I've never seen this happen before.

Try rebooting to get the locks in an expected configuration. Once you've rebooted and before running any other Bedrock commands try running

/bedrock/libexec/busybox ps -o 'pid,ppid,comm,args' 2>&1 | tee /tmp/log3

and give me the output of /tmp/log3. With luck we'll see a Bedrock command sitting there holding the lock that shouldn't be, which will give us a lead to figure out why that's happening.

asmitir commented 2 years ago

Before we forget, remove that set -x set made earlier. It's no longer needed. done. Did the debug brl repair run exit (so you get another prompt) or did it hang (so you have to ctrl-c to make it stop)? My bet is it hung. you guessed it lol.

the log of /bedrock/libexec/busybox ps -o 'pid,ppid,comm,args' 2>&1 | tee /tmp/log3:

https://pastebin.com/An05uKzq

paradigm commented 2 years ago

I see a brl repair call (pid 4495) that appears to be blocked waiting for the lock (via flock with pid 4495). This isn't surprising, as Bedrock configures systemd to run a brl repair on boot to undo some systemd changes. In fact, brl status is probably complaining because this brl repair never gets past the lock step to where it can apply these changes. However, I don't see any other Bedrock process that could be holding the lock brl repair is waiting for. The other Bedrock processes listed - etcfs and crossfs - don't lock.

I offer you two options:

  1. We can keep debugging if you want. Next step would be to try to use lsof to find what is holding the lock. I can provide instructions accordingly.
  2. If you are tired of debugging and just want to get things working, we can probably work-around the issue to just get you going by deleting the lock at boot. You can just run (as root) rm /bedrock/var/lock. New Bedrock processes will re-create it in an unlocked state and should work. I can help you automate this at the right time if you don't want to run it manually every boot.

Let me know what you'd prefer.

asmitir commented 2 years ago

im tempted to chose the second option, but im interested in fixing it now hahaha.

btw, do you have any idea of why this is happening? how can i work around this next time im installing bedrock? Im using my old arch machine, so it isnt a clear install, this might be the reason, since i tested bedrock before and got the exact same thing on another machine running arch.

paradigm commented 2 years ago

im tempted to chose the second option, but im interested in fixing it now hahaha.

I'm happy to go either way, just let me know which you prefer.

btw, do you have any idea of why this is happening? how can i work around this next time im installing bedrock?

No idea. I've never seen this happen before.

Im using my old arch machine, so it isnt a clear install, this might be the reason, since i tested bedrock before and got the exact same thing on another machine running arch.

Am I interpreting you correctly in saying you've tried Bedrock by hijacking Arch twice, and this happened both times? It might be something specific with your pre-hijack Arch setup, although I'm at a loss for what specifically that would be. If it wasn't a minimal/fresh install, you could try installing Arch (or some other distro) and hijacking it before doing any setup. Then go and do your normal setup and, step-by-step, check if the issue reproduces (which may require a reboot to actually check properly) to see if you can find what you're doing that's triggering the issue. If you do figure it out this way, do let me know.


In the next major Bedrock version - 0.8.X - the corresponding code will be a bit lower level. I might be able to have it tell us which process holds the lock when it tries to get a lock, which we can then use to debug this if it happens there. It'll be difficult to add that into the existing 0.7.X code, sadly.

asmitir commented 2 years ago

I'm happy to go either way, just let me know which you prefer.

sure, if you can help me proceed with the debugging it would be much appreciated.

and this happened both times?

yea, two different arch machines.

If you do figure it out this way, do let me know

ill try doing this in a vm another time, if i find any thing interesting ill proceed by posting on the reddit page whats causing this problem so in the future other people dont come across this problem as well.

im looking forward to 0.8!

paradigm commented 2 years ago

sure, if you can help me proceed with the debugging it would be much appreciated.

Open /bedrock/share/common-code with root permissions. In it, find the lock() function. At the very end of the function is this line:

    flock ${nonblock:-} -x 9

Lets put some debug before and after it so the end of the function looks like this:

    echo "$$ locking ${dir}" >> /tmp/lock-log
    flock ${nonblock:-} -x 9
    echo "$$ acquired lock ${dir}" >> /tmp/lock-log

then reboot. Hopefully once you've booted and logged in, /tmp/lock-log will let us know the last process ID which acquired the lock, as well as any that are still waiting in line for it.

After the reboot but before running any other Bedrock commands, provide me both /tmp/lock-log and another

/bedrock/libexec/busybox ps -o 'pid,ppid,comm,args' 2>&1 | tee /tmp/log4

run output so I can compare the process ID in lock-log against process names/arguments.

asmitir commented 2 years ago

contents of /tmp/lock-log https://pastebin.com/S4suk6A5

and /tmp/log4 https://pastebin.com/gGHHwH2a

paradigm commented 2 years ago

At the end we see PID 4500 waiting to acquire the lock. From your ps output, that's the brl repair and flock as expected.

The first two lines show PID 2897 has acquired the lock. I can't find it in your ps output, though. When a process and its children die it should release the lock automatically. The fact the PID no longer exists means the lock should have been freed. My only guess there is whatever acquired the lock created children processes before it died, and those children are still alive.

The lock is tracked as a file descriptor for /bedrock/var/lock, and lsof tells us about what programs have what file descriptors. Lets try this:

lsof 2>&1 | tee /tmp/log5

hit me with log5.

asmitir commented 2 years ago

heres the log5, file is very big doe.

https://files.catbox.moe/7ie3me

paradigm commented 2 years ago

None of those 113899 open file entries were the Bedrock lock file. While we were going back and forth here, I tried hijacking Arch and couldn't reproduce the issue. The only idea I have left to debug this is for you to try hijacking a minimal, fresh Arch install in a VM, then configuring system while tediously rebooting and checking brl status between each step to see what's causing the issue. I understand that's a pain; don't feel obligated to do so if you don't have the patience, and no rush on it if you do decide to give it a go.

As a work-around, you could probably make a systemd unit file with

ExecStart=rm /bedrock/var/lock

which should remove the lock and ensure any following Bedrock code uses a fresh lock.

asmitir commented 2 years ago

the lock is gone, so yea. reckon it works.

i will try testing what is causing soon. thanks a lof for the support! i appreciate it.

paradigm commented 2 years ago

Happy to help, good luck!