CaseyBakey / chaosp

Customized Hybrid AOSP (CHAOSP)
29 stars 10 forks source link

Porting CHAOSP to Android 10 #5

Closed CaseyBakey closed 3 years ago

CaseyBakey commented 4 years ago

Since the other issue is getting rather large, here is another one, to summarize the missing points, the problems, and the progress on porting CHAOSP to Android 10. It could help @ubergeek77 and others to help me :p

To be clear, here is what is making CHAOSP possible

- RattlesnakeOS (https://github.com/dan-v/rattlesnakeos-stack) and particularly the build template (https://github.com/dan-v/rattlesnakeos-stack/blob/10.0/templates/build_template.go). This is maybe not the best thing to start from this project (which goal is to build AOSP on AWS) and hack it to be able to build AOSP locally. But it's working right now. A solution would be to approach https://github.com/hashbang/aosp-build for example. - android-prepare-vendor (https://github.com/anestisb/android-prepare-vendor). This project is already used by RattlesnakeOS. Its goal is to recover the missing software pieces (some drivers and stuff are not open-sourced on our Nexus/Pixel devices) from the Google factory images, to be able to boot a fonctional device. For each device, there is a config file which reference the missing binary blobs to extract from the images and to add to the build RattlesnakeOS/CHAOSP tree. These files are different between major Android versions (they're talking in API version number here), so the config files need to be updated to follow devices updates. During many months, Android 10/Q/API29 was missing from these config files, causing delay to RattlesnakeOS and thus CHAOSP, to migrate to Android 10. Last point: for each device and each API version, there are 2 different set of files. The "naked" configuration which is the "minimal" set of binary blobs to have a functional device, and the "full" configuration which is more complete but necessary if you want to add the Google Apps. And, for now, only the "naked" configurations are available, preventing us to add Google Apps to our build. - OpenGapps (https://opengapps.org/ and https://github.com/opengapps/aosp_build). This project is needed if you want to add Google Apps to your AOSP build. This project was also lacking behind Android 10 but it now seems to be updated, at least for the minimal packages (nano and pico). But it's not possible to use it right now since we're missing the android-prepare-vendor "full" configuration for API29 for the different devices. - Magisk (https://github.com/topjohnwu/Magisk) to make our device rooted. This project is always up-to-date thanks to @topjohnwu. He supports Android 10 since the different public preview of Q. So clearly not blocking here ;)

Right now, we're missing the "full" configuration from android-prepare-vendor, for API 29 (Android 10), for the different devices. We won't have Google Apps without this. The rest (Android 10, Magisk) is nearly fine. I could use the help of @typeproto187 here since he managed to create the "full" configuration for taimen on API 29.

Here are the big changes in Android 10 which are making our lives more difficult:

- dynamic partitions (https://source.android.com/devices/tech/ota/dynamic_partitions) which implicates... - ...userland fastboot (https://source.android.com/devices/bootloader/fastbootd) which implicates... - ...boot control HAL (https://source.android.com/devices/tech/ota/ab/ab_implement#bootcontrol) - modifications (again!) to system-as-root, particularly the [boot ramdisk](https://source.android.com/devices/bootloader/system-as-root#ramdisk) and the [partition layout](https://source.android.com/devices/bootloader/system-as-root#partition-layouts-abdevices)

So, to summarize, on devices where all new features of Android 10 were retrofited/implemented (Pixel 3/3XL/3a/3aXL and Pixel 4/4XL) the boot image now contains a ramdisk which is used:

- to boot in recovery
- to boot in fastbootd (the disguised recovery mode, responding to fastboot and not adb)
- to boot in normal mode
During a normal boot in Android 9:

``` 1. "/init" from system.img is called 2. not sure: "/init" from system.img is called a second time with another argument 3. init.rc scripts are executed and Android boot ```

During a normal boot in Android 10:

``` 1. "/init" from this ramdisk (boot.img) is called 2. "/system/bin/init" from system.img is called with argument "selinux_setup" 3. "/system/bin/init" from system.img is called with argument "second_stage" 4. init.rc scripts are executed and Android boot ```

So this ramdisk is still the place to put Magisk.

Here are some experiments:

When we're adding Magisk (in the boot.img) to a factory image, and when we're flashing it (./flash-all.sh), at some point the device has to reboot in fastbootd (the recovery-style fastboot). If Magisk is not added correctly, the device can't boot in this fastbootd mode.

If we add it correctly (different paths as Android 9), the device is able to boot in fastbootd and the flash of the factory image can continue. It seems we're also able to boot in recovery mode.

But after that, the device can't boot normaly. It seems to be related to the boot control HAL (https://github.com/topjohnwu/Magisk/issues/2214)

So, by modifying this boot.img ramdisk, it seems that:

It would seem logical that, if this boot slot isn't marked as succesful, it shouldn't boot to neither of these modes. But it seems to be different than what we're thinking...

Here are some ideas:

After having digged to some code and checking @topjohnwu solution (of forcing the slot as succesful with a static binary version of the bootctl):

Here it is for now, as I won't be able to play with CHAOSP before 2 days.

CaseyBakey commented 4 years ago

Little update: I managed to add a new function in the recovery to be able to mark the current slot as successful. In fact, I replaced the "Wipe data/factory reset" function to be on the safe side :p

Because, yes, you can be unable to boot the slot A (for example) "normally" (in boot mode), but you can still be able to boot in recovery with such unsuccessful slot.

The build is ongoing, I'll report here if I can toggle such a slot from recovery and being able to boot normally afterwards.

Btw @ubergeek77, I did spot what you called "AVB stuff before the release/signing" part. The avbtool seem to be called, only for the boot partition(!), during the build_aosp part.

I need to check what this tool is really doing to determine if add_magisk is really called too late. Spoiler: it seems...

ubergeek77 commented 4 years ago

Hey great work! This sounds awesome.

But just a heads up: the engineers at Google no longer seem to have functioning brains, and boot.img is now unified in that it is responsible for booting system, rescue, AND recovery. So, if you ever get in a failing boot state for whatever reason, and the bootloader is refusing to boot because it has decided an otherwise perfectly bootable boot.img is "invalid," you will be unable to boot into recovery.

The option you added to recovery is a fantastic addition and certainly a good replacement for the data wipe option (that honestly shouldn't be there anyway), but don't think it will get you out of a sticky situation - sadly, it will not.

Now, what will definitely be interesting is whether or not the boot option you added would prevent this factory image situation if you used it before flashing.

Edit: re-read your comment; just to clarify what I said above, about your comment here:

Because, yes, you can be unable to boot the slot A (for example) "normally" (in boot mode), but you can still be able to boot in recovery with such unsuccessful slot.

This is only true if you try to boot recovery after flashing a factory image that "would be" considered invalid, but hasn't been marked yet. I'm 99% certain that the "trigger" for what makes the bootloader think a slot is invalid is Stage 2 of the factory image flash, where it boots into this new "userspace fastboot" to continue the flash. If you've ever reached the bootloader screen and see the message "invalid boot slot" or "failed to boot boot.img," it's too late, the bootloader has marked both slots as unbootable, so you will be unable to even boot recovery.

So, if you tried booting into recovery before this happens, you'd probably be fine. But another good option would be to somehow always keep one of your boot slots without Magisk, but with the recovery option you added. That way your risk of a brick is basically zero, and you can always recover from a failed boot state.

CaseyBakey commented 4 years ago

To be clear, here are what have been tested (keep in mind I'm still (yet) adding Magisk in my way, and maybe too late):

So it may just be that add_magisk is called too late, and your solution is now the only viable way to achieve that. Will try this afternoon.

But, if I remember correctly, in your tests, you didn't manage to finish the flash part in fastbootd when flashing with Magisk included. But I do. So maybe it's something in between :p

CaseyBakey commented 4 years ago

So, the error when the device doesn't boot and reboot to the bootloader is "reboot bootloader"...

Looking at the fastboot variables using fastboot getvar all, I can see that this slot (A) is NOT marked as successful.

So I did boot to the recovery (and manage to do it despite this slot NOT marked as successful), and used my new recovery function.

I then restart to bootloader to check the variables again. This time the slot A is maked as successful (so my code worked ! :p).

When I then try to boot normally, same error, I'm back to the bootloader with "reboot bootloader".

And the slot is still successful oO

CaseyBakey commented 4 years ago

Just did my first build with YOUR (@ubergeek77) way of adding Magisk.

And I encounter the same errors as you:

After that, tried:

So I'm getting same results as you.

We can conclude that with Android 10, Magisk HAS to be added earlier than before because the AVB footer is computed and appended BEFORE the "release" part.

I still don't get why it's not possible to boot into fastbootd when Magisk in added YOUR way, but succeed when adding in MY way.

I'm gonna diff the boot.img images to try to understand, since it should be nowhere else to spot some diff between YOUR way and MY way.

CaseyBakey commented 4 years ago

Ok, I diffed the two boot and:

So I did correct my script to do it (again!) my way and here are the results:

So it's a full win! And we can still add Magisk after "build_aosp" and before "release"! The AVB footer must be recomputed somehow during "release".

And the last test:

So, clearly, the data wipe occuring by default while flashing factory images is causing this dead-lock/non-bootable situation.

Btw, even if the slot is marked unbootable and not succesful, it's possible to mark it bootable again by using "fastboot set_active a" for example. Slot will be marked as bootable, and thus it's possible to go to the recovery and trigger my code to mark A successful. I can then reboot to Android! => WIN again ;)

Btw, I did commit all of these modifications to 10-testing branch if you wanna take a look.

Next step:

CaseyBakey commented 4 years ago

Hi there, I did try some different stuff to be able to rm -rf /data/adb/ from recovery but the /data/ partition doesn't seem to be mounted by default, event if calling the load_volume_table function from my custom code shows me /dev/block/bootdevice/by-name/userdata as mounted.

TWRP is able to mount /data/ and wipe/format/etc. but since TWRP is a whole different project, I can't base the whole understanding of my problem on it. Anyway, here is a TWRP commit that could help, specifying some mount flags.

Since the headers/code seems to be already included in the recovery codebase to umount, as seen here, I should be able to issue a "raw" mount command with these flags.

I won't spend a lot more time on this idea/enhancement. If I don't manage to do it in the few next days, I'll open a feature-request on Magisk to add a feature that will auto-remove /data/adb/ when there are too much failed boots. It shouldn't be too hard to implement this.

ubergeek77 commented 4 years ago

Thanks a lot for your findings @CaseyBakey. This is all very helpful to me. I'm just coming back to ROM development, so I'll definitely be using your findings for my builds.

There is one important thing I'd like to point out, however. Regarding:

Btw, even if the slot is marked unbootable and not succesful, it's possible to mark it bootable again by using "fastboot set_active a" for example. Slot will be marked as bootable, and thus it's possible to go to the recovery and trigger my code to mark A successful. I can then reboot to Android! => WIN again ;)

This is true, but if the bootloader is locked, set_active cannot be used, meaning this isn't a recovery method in this instance. I don't know what your goals for this are, but my goal is to be able to have Magisk along with a locked bootloader.

As we've learned, a locked bootloader is certainly possible. But, as we've also learned, Google's engineers made the braindead decision to tie recovery and rescue mode to the boot.img responsible for booting the system. Thus, if the bootloader is ever in a state where it rejects boot.img for whatever reason - Magisk or otherwise - recovery and rescue are both inaccessible, meaning it's impossible to flash a new image or otherwise recover from this situation, causing a brick. I know I've been quite vocal against Google in this regard, but I'm frustrated; I don't know how they thought this was a good idea. Yes, it adds difficulty for us in this situation, but it's not unreasonable to suggest a normal, non-power-user with a locked bootloader could encounter this problem. In the past, a corrupted boot image could be fixed from recovery, but now even that option is unavailable for the average user. I don't even see what point recovery serves now in Android 10 due to how it shares a boot image, and how it is unusable in the same situations that a user would need to use it in the first place.

In any case, your patches and findings will at the very least make this easier to actively avoid, even if caution is necessary.

As for as a solution goes, apart from just adding Magisk (which we've now achieved), I think a good "brick-prevention" measure would be to always make sure the inactive boot slot doesn't have Magisk. If I'm understanding the bootloader behavior correctly, this means the bootloader would never reject that "clean" slot. So, even if the system was factory reset (by you or by an unauthorized party in possession of your device), and the active boot slot was rendered unbootable, the user would still be able to boot into recovery in order to mark that boot slot active, or otherwise flash an image in order to recover.

I do wonder, though, if including your "mark boot slot active" patch adds a theoretical attack vector to the device - would that patch allow a malicious actor to forcibly write their own boot.img, which doesn't match the system verity signature, and then force the system to bypass the verity check using your patch and boot it anyway? I think that's worth some consideration.

Regardless, the best solution for everyone would be to get TopJohnWu to reconsider his apathetic stance to the boot control HAL situation, and recognize it as a bug to be fixed, as it certainly is one.

CaseyBakey commented 4 years ago

This is true, but if the bootloader is locked, set_active cannot be used, meaning this isn't a recovery method in this instance

Yep, I'm well aware of that. But since we're talking about flashing a factory image, the bootloader is already unlocked. My findings just allow you to "recover" in case of flashing a Magisk-patched factory image (if you forgot to take out the "-w" argument in the last line of the flash-all.sh, else no need for my tricks).

I don't know what your goals for this are, but my goal is to be able to have Magisk along with a locked bootloader.

It's mine also ;)

As we've learned, a locked bootloader is certainly possible. But, as we've also learned, Google's engineers made the braindead decision to tie recovery and rescue mode to the boot.img responsible for booting the system.

It's dating back to Android 9, with A/B devices implementing system-as-root. For such devices, there was no recovery partition, but the recovery ramdisk was embedded in the boot image. With Android 10, they just added more function to this ramdisk (normal boot, and the userland fastboot). I still don't get why A-only system-as-root devices have a real separated recovery partition, while A/B devices don't but still manage to "waste" way more space by having 2 copies of BOOT/SYSTEM/VENDOR/etc. Was a separated recovery partition more waste that they couldn't afford? -> no sense to me to save such little space.

As for as a solution goes, apart from just adding Magisk (which we've now achieved), I think a good "brick-prevention" measure would be to always make sure the inactive boot slot doesn't have Magisk.

For me, when we're playing with custom ROM and locked bootloader, the only "brick-prevention" measure is to keep the "Allow OEM unlock" toggle ON on the developper settings. If you're bricked, you just go to the bootloader and issue fastboot flashing unlock that will wipe /data but will also allow you to reflash something else.

I do wonder, though, if including your "mark boot slot active" patch adds a theoretical attack vector to the device - would that patch allow a malicious actor to forcibly write their own boot.img, which doesn't match the system verity signature, and then force the system to bypass the verity check using your patch and boot it anyway? I think that's worth some consideration.

My patch is to mark the boot slot succesful, not active. And I don't see any theoretical attack vector with this. We're just "forcing" the bootloader to reconsiderate a slot previously marked as unsucessful. It doesn't change a thing about the working of AVB/verity stuff. It the slot marked as active/successful is tried to be booted, but not correctly signed (attacker stuff for example), it just won't boot.

Regardless, the best solution for everyone would be to get TopJohnWu to reconsider his apathetic stance to the boot control HAL situation, and recognize it as a bug to be fixed, as it certainly is one.

From my understanding, it's absolutely not a Magisk bug, but a boot control HAL "feature" which isn't totally understood yet, and that piss us off. His solution is a nice work-around, and I based my recovery patch on it in fact.

Final words:

The wipe is triggered by the "-w" argument which is put by default in the ./flash-all.sh Just take it out.

alaviss commented 4 years ago

From my understanding, it's absolutely not a Magisk bug, but a boot control HAL "feature" which isn't totally understood yet, and that piss us off.

It's a Magisk "bug" from what I can tell. Reading Magisk v20.4 source shows that magiskd will trigger a reboot if /data/adb is not available at boot (which it obviously won't be on first boot). This reboot is too early, which eats up "boot retries count" until the boot slot is considered "unbootable". I verified the theory by making /system/bin/reboot rejects the call if it's from magiskd with this patch, and afterwards you can see this:

09-21 03:01:45.823  1081  1081 I reboot  : Reboot requested by PID 782 (magiskd)
09-21 03:01:45.823  1081  1081 I reboot  : Reboot request denied

In the boot log, and the system boots normally afterwards. Magisk will then be functional after the initial setup from Magisk Manager.

The only part that bothers me is how marking the slot successful allows Magisk to work after just one reboot.

But all of this info will probably be obsolete, given this commit https://github.com/topjohnwu/Magisk/commit/fc1844b4dff14d589626dd770c64d5da892e1e0c rewritten the code to no longer trigger a reboot. I'll try to see if the commit fixes the issue by making a build with Magisk Canary soon.

alaviss commented 4 years ago

I'll try to see if the commit fixes the issue by making a build with Magisk Canary soon.

Yep, Magisk Canary no longer exhibits this issue.

alaviss commented 4 years ago

It appears to me that dm-verity will be disabled for system on legacy SAR devices (<= Pixel 2 XL) when Magisk is in use.

This is due to the fact that dm-verity is configured for system via the dm= parameter on boot (avbtool calculate_kernel_cmdline can be used to see what this will be configured to) only if an initramfs is not used (see this and this).

Furthermore Magisk will attempt to mount system_<slot> directly, thus bypassing the dm-verity generated device (if any).

I've tried to make the kernel configures the devices passed via dm= at boot, however for some reason this breaks Magisk and will cause a device restart before any logging facility is available :( I can still boot to recovery, so I'm sure that the kernel is not broken.

For newer devices using 2SI, dm-verity should no longer be a concern as first stage init will set it up for Magisk.

alaviss commented 4 years ago

breaks Magisk and will cause a device restart before any logging facility is available

This is a direct consequence of magisk trying to mount system_<slot>, as it's not possible to do so once the verity device is configured. I made a patch to solve this by making Magisk consider vroot as a possible mount target.

CaseyBakey commented 4 years ago

Nice spot for the Magisk reboot that lead to an unbootable slot! That'll be automatically fixed for CHAOS when @topjohnwu will release Magisk v20.5 I guess ;-)

That also explains why taking out the "-w"/wipe data from the flash-all.sh script also avoid making the slot unbootable (no wipe -> /data/adb/ is thus here on first boot after flash)

For the legacy SAR devices, I'm sorry but I can't help: I only have a Pixel 3 as a daily driver, and I'm testing stuff' on a Pixel 3a before updating my Pixel 3 to Android 10 (lacking 1 year of updates...).

Right now I can:

alaviss commented 4 years ago
  • add opengapps to the ROM but I'm getting a bootloop because of this bug (opengapps/aosp_build#269) or maybe because of an android-prepare-vendor bug, I don't know yet.

My guess is that euicc and euiccpixel are installed in (/system)/product, but the whitelist is in /system. Moving the whitelist to (/system)/product/etc/permissions should solve the issue.

CaseyBakey commented 4 years ago

Said this way, it sounds quite logical ah ah. I'm gonna try it tomorrow and I'll tell you!

CaseyBakey commented 4 years ago

@alaviss you were right! Thanks a lot for this! I was stuck on this since a long time! Look at the simple fix it needed: https://github.com/AOSPAlliance/android-prepare-vendor/pull/39

CaseyBakey commented 3 years ago

Closing this since AOSP 10 was OK, but now AOSP 11 is also OK ;)