Live Migration: Instruction_Abort when executing restored VM

bows7ring commented 3 months ago

Recently, while developing Realm VM live migration, we encountered an instruction_abort issue.

The specific scenario is as follows...

When importing the Realm VM on the destination platform:

We first used smc_rtt_init_ripas to set the entire RAM area of the Realm VM as unassigned RAM.
Then, we established the IPA mapping using the smc_data_create interface. (Currently, our temporary solution does not consider efficiency, and we traverse all gfns in the kvm memslot.)
- delegate the dst_granule.
- smc_create_data .
- if smc_data_create fails with RMI_ERROR_RTT, we create the missing RTT and retry. (I've omitted the parts related to Qemu, describing only the operations in RMM here)
We implemented a register import interface to load REC information from QemuFile.
After the RAM and registers are imported, upon entering smc_rec_enter, the vCPU executes the first instruction pointed to by the PC, which results in an instuction_abort.

Environment:

● Simulation platform is FVP, ShrinkWrap cca-3-world. ● All components (QemuVMM, KVM and RMM) in this cca-3-world environment have new code added, but we have kept the original interfaces unchanged.
● This bug might be impracticable to reproduce, so I'll try my best to desceibe it..

ShrinkWrap Log:

Discussions:

The RMM spec describes the cause of instruction abort as follows:

However, for S2TTEs with a valid IPA, the states of RIPAS and HIPAS will not be checked, refer to this in issue #21 :

We re-use some bits in pte (namely bits 5 and 6) for storing the RlPAs state when the pte is invalid. When the pte is valid, TfRMM assumes that RlPAS is always RlPAS_RAM and hence we do not refer to these bits for a valid pte.

By the way, if we don't populate the Realm VM's memory, only load the REC registers, and start running, the Realm VM will enter an endless loop because RMM choose to handle the instruction_abort himself. However, with the memory populated, RMM forwards the instruction_abort to KVM, and the system panics. Therefore, I guess the memory import is at least partially correct...

The logic for handling inst_abort in the relevant code：

Conclusion:

We are not concerned about privacy and performance at this stage; we only wish to verify whether VM can restart successfully on dst platfrom after populating all the plaintext-exported guest pages back to their original IPA.

Our questions can be summarized into two:

Since RIPAS is not checked, why and how does the CCA hardware trigger instruction_abort ?
Did we mess up anything in the import of Realm pages and REC registers?

We sincerely appreciate your ongoing assistance. If you need more information or have any suggestions, please let me know.

suzukikp commented 3 months ago

Hi,

Have you "restored" all of the RAM regions ? With RMM-1.0, we cannot "load any content into Realm memory" after the ACTIVATE step. Since you are restoring a Realm VM from a previous state, you need to make sure all of the RAM region is POPULATED (not just INIT_RIPAS) before ACTIVATE. Future RMM spec might add support for "Paging" which could let the VMM load "previously" captured content into an ACTIVE Realm, with some guarantees from RMM on the contents.

Looking at the logs:

The ESR=0x820000a5 => EC => Instruction Abort from a lower Exception level, IFSC="Granule Protection Fault on translation table walk or hardware update of translation table, level 1." ?

This should never have happened, as the RMM must ensure that the "Granule" mapped in Protected space is in the Realm PAS.

djordje-kovacevic commented 3 months ago

By the way, if we don't populate the Realm VM's memory, only load the REC registers, and start running, the Realm VM will enter an endless loop because RMM choose to handle the instruction_abort himself.

I assume that this is because DRAM region is left with ripas=empty state. Instruction fetch from ripas_empty page is (by the RMM spec.) reported to the Realm (the exception is injected), not to the Host. The Realm's exception handler also resides in ripas_empty DRAM so the Realm gets into the endless loop of generating instruction aborts from the first instruction in its exception handler.

djordje-kovacevic commented 3 months ago

with the memory populated, RMM forwards the instruction_abort to KVM, and the system panics. Therefore, I guess the memory import is at least partially correct...

I agree with Suzuki's analysis but it is quite difficult to figure out why the abort has happened, so I am thinking about the way how to simplify the case...

As the reported abort is: "Instruction Abort from a lower Exception level, IFSC="Granule Protection Fault on translation table walk or hardware update of translation table, level 1.", can you run (and migrate) the Realm that runs with stage 1 MMU disabled? A simple while(forever); ... will do;

(1) If the tests passes OK, we'll know definitely that it is something about stage 1 MMU (likely EL1 sys reg misconfiguration, so e.g. you may focus your search on REC migration).

(2) If it still fails, there is something more serious, but it would be easier to figure it out on a simpler test.

bows7ring commented 3 months ago

@suzukikp ,

Thanks for your help!

Have you "restored" all of the RAM regions ? With RMM-1.0, we cannot "load any content into Realm memory" after the ACTIVATE step. Since you are restoring a Realm VM from a previous state, you need to make sure all of the RAM region is POPULATED (not just INIT_RIPAS) before ACTIVATE. Future RMM spec might add support for "Paging" which could let the VMM load "previously" captured content into an ACTIVE Realm, with some guarantees from RMM on the contents.

RMM does have a REALM_STATUS check, and I've modified Realm's INIT and ACTIVATE code in Qemu, to make it work fine with Migration code.

static void rme_vm_state_change(void *opaque, bool running, RunState state) {
  // .....

  /*
   * When booting RME vm from a QemuFile snapshot,
   * it goes like this:
   */

    // Init RIPAS for entire RAM region 
    kvm_vm_ioctl(kvm_state, KVM_CAP_ARM_RME_INIT_IPA_REALM);   

    // `cgs_migration()`  goes like this:
    kvm_vm_ioctl(kvm_state, CCA_MIGRATION_LOAD_RAM);        
    // populate every guest pages in kvm memslot using `smc_data_create`

    // load REC registers
    kvm_vm_ioctl(kvm_state, CCA_MIGRATION_REC_LOAD, &input);

    // set Realm status to `ACTIVE`
    kvm_vm_ioctl(kvm_state, KVM_CAP_ARM_RME_ACTIVATE_REALM);

 // .....
}

In these two hours, I rechecked this populate problem you mentioned, and I am sure that I have populate all the pages onto correct IPA.

I did an experiment in RMM: after we had completed data_create at a certain IPA, I did a software-page-walk and logged the RAM contents on the PA where the IPA was located and the result was correct:

For example, the Guest Kernel Image is mapped to 0x40200000, which will not be populated during the normal VM "restore" process. The content on this address is the same before and after the migration: 4d 5a 40 fa 27 f4 6b 14 00 ... .

Therefore, I think that the EXPORT, IMPORT( init and populate) of RAM pages should be correct.

The ESR=0x820000a5 => EC => Instruction Abort from a lower Exception level, IFSC="Granule Protection Fault on translation table walk or hardware update of translation table, level 1." ? This should never have happened, as the RMM must ensure that the "Granule" mapped in Protected space is in the Realm PAS..

As for this, I don't know how to do further testing at the moment..

bows7ring commented 3 months ago

@djordje-kovacevic ,

Thanks! Your explanation of "abort-injection-loop" makes sense, pages' RIPAS should be empty without RIPAS_INIT and data_create.

I'll check REC's sys-regs as you suggested.

By the way, these Realm VMs were being migrated in their early stages of "Kernel Boot" (simply because I don't have the patience to wait for it to finish), I don't know if this will have any impact.

can you run (and migrate) the Realm that runs with stage 1 MMU disabled? A simple while(forever); ... will do;

I will redo some experiments according to the while(true); condition you mentioned.

bows7ring commented 3 months ago

@djordje-kovacevic ，

Thanks for your advice, I migrated the single-core VM in while(true) loop, but nothing has changed.

Can we now confirm that the sys-regs are correct?

(2) If it still fails, there is something more serious, but it would be easier to figure it out on a simpler test.

If indeed we have encountered this situation, how should we proceed with debugging?

djordje-kovacevic commented 3 months ago

Please can you confirm: 1) Did you run Realm with MMU disabled in stage 1? With: "while (forever)", I meant that the whole Realm does the absolute minimum, e.g:

start: B start

2) If so, what exception is reported in the log?

bows7ring commented 3 months ago

Okay, sorry, I didn't think it through before and misunderstood the meaning of "disabling the MMU in a while(true) loop".

I will modify the guest kernel booting assembly and retry.

bows7ring commented 3 months ago

I think it worked.

I modified guest kernel's booting assembly at /linux-cca/arch/arm64/kernel/head.S. Kernel will enter loop at primary_entry before enabling MMU.

Now, when restored VM crashes, the new ESR is 82000001. According to bits[5:0], FSC is "Address size fault, level 1", and this leads to EL1 sys reg misconfiguration.

Next, I will start checking the initialization of the sysregs. Besides checking if the values of the EL1 registers are the same before and after migration, do you have any other suggestions?

bows7ring commented 3 months ago

Update:

Now, in most cases, ESR should be `82000025.

However, in some abnormal cases, the ESR might be 82000001 82000005 or 82000009, and I am unable to reproduce these results (Maybe repeatedly tests and shutting down during panic situations could corrupt the rootfs? Can this be an explanation?).

82000025 is reasonable, comparing to 820000a5 when stage-1 MMU is enabled, bit[7] has changed from 1 to 0.

0b0 Fault not on a stage 2 translation for a stage 1 translation table walk. 0b1 Fault on the stage 2 translation of an access for a stage 1 translation table walk.

So, accroding to 82000025 we can ruled out stage-1 fault cases, and the abort type is still GPF.

bows7ring commented 2 months ago

Update:

We found that the number of host cores may have an impact on the migration result.

Realm VM can be migrated successfully when host smp=1 and guest smp=1. (all the migrated Realm VM are single-core here)

However, if we use a host QEMU with 2 cores, it will encounter GPF with esr=820000a5 , just like FVP. (previous tests were using ShrinkWrap's default cca-3world.yaml, which also has 2 cpu)

Why does multi-core host affect the guest migration? Or more specifically, will "memory delegation" run differently on a multi-core host QEMU?
I didn't think abouttlbi of GPT, because I thought it's done in TF-A during the RMI_GRANULE_DELEGATE
We still have a bug about Realm vtimer, I'll describe it in next issue.

suzukikp commented 2 months ago

ESR Indicates the following

EC = Inst. Abort. FSCE = 0x25 = > 0b100101 - Granule Protection Fault on translation table walk or hardware update of translation table, level 1

S1PTW=1 => Fault on S1 page table walk.

Have you made sure that the "RAM" was restored properly without any errors ?

Are you able to provide more information / collect RMM logs to clearly pin point "what the race condition looks like" ? Without proper logs, it is hard to predict what has gone wrong.

Given your case shows a GPF and valid HPFAR=(41dd70)=> IPA=0x41dd70-000. Are you able to collect the relevant calls that dealt with the IPA ( DATA_LOAD = PA for IPA, and DELEGATE calls for the PA, also any DATA_DESTROY calls that could have been made).

suzukikp commented 2 months ago

If your emulation platform supports "trace" it would be helpful to collect trace information about the TLB operations too.

soby-mathew commented 2 months ago

One reason SMP may affect behavior of software is because that's when effect of Caches and incoherency between contents of memory and caches start to have larger effect. Is the FAR value a valid address for the Realm VM ?

TF-RMM / tf-rmm