PSCI coherency with cache off

ARM-software / tf-issues

Issue tracking for the ARM Trusted Firmware project

37 stars 16 forks source link

PSCI coherency with cache off #605

Open glneo opened 6 years ago

glneo commented 6 years ago

Hello all,

This issue is a continuation of my pull request here: https://github.com/ARM-software/arm-trusted-firmware/pull/1451 And so is also related to this issue https://github.com/ARM-software/tf-issues/issues/451

The pull request was closed until I could root cause the locking issue I was having. I think I may now know the cause.

When a PSCI cpu_on call is made we take a lock: https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/psci/psci_on.c#L63. Next, on the starting core's boot path we execute this line with caches still off: https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/psci/psci_common.c#L291. The data .parent_node is on the same cache-line as the lock. As these accesses have mismatched shareability attributes, then according to the ARMv8 ARM (latest version CA) we need to perform the following:

• Each PE that accesses the Location with a cacheable attribute performs a clean and invalidate of the
Location before and after accessing that Location.
• A DMB barrier with scope that covers the full shareability of the accesses is placed between any accesses
to the same memory Location that use different attributes.

We do neither of these from either PE, and so coherency is not guaranteed.

Solutions include:

Performing the required cache maintenance operation and barriers at each location accessed before cache is enabled. This would be prohibitively expensive.
Move all pre-cache accessed memory to the tzfw_coherent_mem section. Caches are enabled rather late, this may include a lot of data.
Split out coherency setup from early_platform_setup and do that early so all platforms can enable caches early.

Andrew

soby-mathew commented 6 years ago

Hi Andrew, The cpu_pwr_domain_node is only written to when all the CPUs are in the same coherency domain and have the same memory attributes. As you have ponted out, the starting core's boot path reads the parent node from the same data structure before enterting the coherency domain. Since it is a read, it is unlikely that this will result in incoherent data.

Now the problem with the spin_lock being accessed by a running PE during this time. This is a different location from parent but due to cache line granularity, these may exist in the same cache line. The point where the starting CPU accesses the spin_lock, it is in the same coherency domain and has the same memory attributes as the running PE. So it is unlikely that the spin_lock will fail in its exclusivity here.

If I can understand your problem better, is there some data corruption happening or is it the case that the starting CPU is accessing the cm_context before it has been initialized properly ?

As an experiment, you could move the spin_lock from the cpu_pwr_domain_node struture and make it a global array in psci_on.c. The cpu_on and on_finish code has to be modified to access this global spin_lock array variable. This will also remove the mixed attribute accesses which you have referred to. Let us know whether that has any effect on your issue.

glneo commented 6 years ago

Since it is a read, it is unlikely that this will result in incoherent data.

This is what I thought, but turns out some platforms still lose coherency.

So it is unlikely that the spin_lock will fail in its exclusivity here.

Again, I thought the same and was wrong. Due to the previous loss of coherency on this cache-line, even data shared after the second PE has entered coherency, can not be guaranteed to be exclusive.

The fix you suggest does prevent this one instance of the issue, but only kicks the issue down the road. By the ARM ARM, all accesses need cache maintenance, and all PEs need to have barriers. What I'm recommending is an audit for all access (read and write) from the second PE before it turns caching on and ensure every location accessed is to a cache-line that marked non-caching for all cores.

soby-mathew commented 6 years ago

The fix you suggest does prevent this one instance of the issue, but only kicks the issue down the road.

That suggests the problem is not due these accesses.

By the ARM ARM, all accesses need cache maintenance, and all PEs need to have barriers. What I'm recommending is an audit for all access (read and write) from the second PE before it turns caching on and ensure every location accessed is to a cache-line that marked non-caching for all cores.

You haven't described the issue you are facing and hence I am struggling to understand the problem.

The specific section you have quoted is for cacheable accesses to location by PEs with differing shareability attributes. This is not the case for the instance you have pointed out (both the cached and non cached access have the same shareability attribute when accessing cpu_power_domain). There is a read to the adjacent data of the spin_lock being accessed by another CPU. This is not an issue because this adjacent data is init-once during boot and remains static throughout runtime.

You seem to be suggesting that you dont rely on the spin locks to work if there has been a non cache access to that cache granule eventhough the locks themselves are only accessed in full coherency. The CPU makes speculative access after the MMU has marked out regions and these accesses cannot be determined from software which implies all of the caches have to be invalidated before using spin-locks. It doesn't seem right to me.

The PSCI code has been reviewed intensely and we have accounted for all the access when made with differing attributes. Which doesn't mean it is bug free but we would need to better understand the problem you are facing.

danh-arm commented 6 years ago

Hi. I'd just like check a few things:

Are we talking about the upstream TI K3 platform here? I notice K3 is a 4 cluster platform with 2 Cortex-A53s on each cluster. If not, please could you describe the CPU topology?
In the issue described above, is the target CPU on a different cluster to the CPU where the PSCI CPU_ON call originates?
When powering on the first CPU in a cluster on your platform, are you sure the HW automatically guarantees that the new CPU/cluster is cache coherent with the other running CPUs in the system? At least on Arm CCI products prior to CCI-550 (i.e. prior to Arm DynamIQ platforms), it is necessary for software to explicitly enable coherency for the cluster, for example see https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/arm/common/arm_cci.c#L38. Could something equivalent be needed on your platform?

glneo commented 6 years ago

@soby-mathew

both the cached and non cached access have the same shareability attribute

This is what we are not sure about. To be a bit lower level, any non-cacheable read or write will generate a non-snooping transaction [0]. And according to the bus documents [1]:

ReadNoSnoop is a read transaction that is used in a region of memory that is not Shareable with other masters.

This would indicate that non-cached is non-shared. Looking back at the ARM ARM I don't see it explicitly stated what the shareability of non-cached transactions are supposed to be (Maybe I missed it, it is a rather large document..). If you can find that it would help us out (having all the relevant sections collected together would be nice).

@danh-arm

Yes, K3
Yes
No, we are brought into coherency by the cluster power up sequence.

[0] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500d/ch06s05s02.html [1] IHI0022E_amba_axi_and_ace_protocol_spec.pdf (needs special access)

soby-mathew commented 6 years ago

This is what we are not sure about. To be a bit lower level, any non-cacheable read or write will generate a non-snooping transaction [0]. And according to the bus documents [1]:

I am not sure how this gets translated on the bus. It seems we would need to experiment a few things to understand the issue better.

I went through the TI_K3 platform port and a few comments below:

HW_ASSISTED_COHERENCY

Could you set this flag to 0 as this is meant only for DynamIQ like systems which means

CPUs are coherent in different clusters without explicit interconnect programming
The cache maintenance is done is hardware during power down .
The CPU power management is arbiterated and done by hardware.

It seems only point 1 is applicable to TI K3 and hence this flag is not applicable.

USE_COHERENT_MEM

Please enable this flag as it allocates a page in RAM with device memory atributes for the bakery locks in PSCI. This is the preferred option as the other alternative (i.e. bakery locks in normal memory) is much more complicated and relies on cache behaviour which may muddle the issue.

WARMBOOT_ENABLE_DCACHE_EARLY

Please enable as this flag is meant for the case when explicit interconnect programming is not required for coherency. This means that the cache can be turned on along with MMU enable. When this flag is enabled no shared data is accessed with mixed attributes by multiple PEs in warm boot code path. During power-off, only the bakery locks and aff_info_state is accessed with mixed attributes (i.e after the caches are tuned OFF and flushed to memory). The bakery locks are meant to be used for mixed memory attribute access and aff_info_state has sufficient cache maintenance operations in place to handle this.

After the above build options have been changed, please let us know whether the issue is still being reproduced. If so,

Does the issue happen after a single powerup-powerdown cycle ? How often does the issue happen ?
What is the data getting corrupted ?

glneo commented 6 years ago

We have all 3 items for HW_ASSISTED_COHERENCY.

If that flag is going to start targeting DynamIQ platform specifics then it should be re-labeled as such. If it also implies WARMBOOT_ENABLE_DCACHE_EARLY then is should just imply that in the makefile and we can drop the double checking. Would you like a patch for that?

danh-arm commented 6 years ago

We have all 3 items for HW_ASSISTED_COHERENCY

OK, but it's possible this not working as expected. HW_ASSISTED_COHERENCY is just an optimization and setting this to zero should work even on platforms where these HW features are available.

If that flag is going to start targeting DynamIQ platform specifics then it should be re-labeled as such

No, we deliberately tried to make this generic but it's possible there are different assumptions being made between DynamIQ platforms and other HW_ASSISTED_COHERENCY platforms like yours.

But let's try setting this to zero as Soby suggests to narrow the problem down.

If it also implies WARMBOOT_ENABLE_DCACHE_EARLY then is should just imply that in the makefile ...

Yes, HW_ASSISTED_COHERENCY implies WARMBOOT_ENABLE_DCACHE_EARLY. The latter only has any effect when HW_ASSISTED_COHERENCY is zero. No patch needed thanks.

It might be worth trying with both WARMBOOT_ENABLE_DCACHE_EARLY enabled and disabled, again to try to narrow the problem down more.

glneo commented 6 years ago

OK, but it's possible this not working as expected. HW_ASSISTED_COHERENCY is just an optimization and setting this to zero should work even on platforms where these HW features are available.

This issue is happening when HW_ASSISTED_COHERENCY is zero. My initial post was not in reference to any specific platform, most platforms can start cores in such a way that the second core gets to the lock only after the first core has written the needed data, K3 and whatever platform was in use by https://github.com/ARM-software/tf-issues/issues/451 does not end up so lucky. I would like to solve this generically so it does not bite someone else.

No, we deliberately tried to make this generic but it's possible there are different assumptions being made between DynamIQ platforms and other HW_ASSISTED_COHERENCY platforms like yours.

Then make up your minds, @soby-mathew in the comment right above yours said:

HW_ASSISTED_COHERENCY Could you set this flag to 0 as this is meant only for DynamIQ like systems

And in the locking thread:

The HW_ASSISTED_COHERENCY flag is intended for use in Arm DynamIQ platforms (or similar), which are v8.2+.

We do not need to narrow the problem down. I have pointed out exactly what the problem is in the starting post. I'm not looking for workarounds for my platform, I would like to just fix this issue.

danh-arm commented 6 years ago

I would like to solve this generically so it does not bite someone else.

Of course, but it will be easier for us to diagnose if we have a specific platform we can study and a clear description of the symptoms. You are claiming there's a problem in the generic code but we think there could also be problem in the platform specific code or the hardware itself. We were looking at the upstream K3 code but it sounds like you are using a modified version. If so, can you point us to this? It would be good to get an answer to Soby's previous question: "If I can understand your problem better, is there some data corruption happening or is it the case that the starting CPU is accessing the cm_context before it has been initialized properly ?"

Then make up your minds ...

There is no contradiction between what Soby and I said. The first quote says "DynamIQ like systems" and the second quote says "DynamIQ platforms (or similar)". As far as we can tell, K3 is both "DynamIQ like" and "similar to DynamIQ". If it turns out that HW_ASSISTED_COHERENCY only works for DynamIQ systems then we would need to either fix the code under this flag or rename the flag (as you suggest). But it seems this a red herring anyway since you say the problem occurs when HW_ASSISTED_COHERENCY is zero.

I have pointed out exactly what the problem is in the starting post

You have pointed out some generic code where you claim there is a problem. We don't accept that yet (for the reasons Soby stated above), which is why we're trying to get more information in order to help.

glneo commented 6 years ago

This is with K3, the only modification from upstream is turning HW_ASSISTED_COHERENCY off. It is on in upstream specifically to workaround this issue, but we see other issues with coherency in TF-A later that I believe are related, so I would like to solve this first for sanity sake.

To further describe the issue, yes, the starting CPU is accessing the cm_context before it has been initialized properly, the reason is that the starting CPU ignores the lock here https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/psci/psci_on.c#L195. And the reason it ignores the lock is that the first core has this lock in its cache but due to the cache-off access by the starting core of same cache-line data, the second core never snoops the data from the first core, even after the second core entered coherency. All this aligns with ARM spec as far as we can tell. All accesses can break coherency and need to be resolved, not just writes.

The only way around this as far as I can tell are my solutions in my first post, most likely the any and all data accessed (read or write) by any core with cache off must be in its own cache-line and that line marked as non-caching for all cores.

soby-mathew commented 6 years ago

It is on in upstream specifically to workaround this issue, but we see other issues with coherency in TF-A later that I believe are related,

Thats worrying. It seems there are underlying problems which needs to be understood.

And the reason it ignores the lock is that the first core has this lock in its cache but due to the cache-off access by the starting core of same cache-line data, the second core never snoops the data from the first core, even after the second core entered coherency.

This is the first suggestion I made. If you move the spin_lock out of cpu_power_domain structure, then you dont have the case of second core accessing the cache line prior to entering coherency. But you already said that doesn't solve the problem

The only way around this as far as I can tell are my solutions in my first post, most likely the any and all data accessed (read or write) by any core with cache off must be in its own cache-line and that line marked as non-caching for all cores.

That solution will not fly. The ARM ARM makes suggestions on how to access mixed attribute accesses and we adhere to them. As I said earlier, when you enable WARMBOOT_ENABLE_DCACHE_EARLY = 1 , there is no shared data being accessed with mixed attributes in warm boot code path. But you still have issues.

Also the upstream code seems to send an sev for waking up the CPU. Are all the CPUs waiting in a holding pen to be send an event ? Also the power off hook dont do anything.

You haven't tried with USE_COHERENT_MEM=1 build option. If you let us know your observation with this flag, it will be helpful.

glneo commented 6 years ago

Thats worrying. It seems there are underlying problems which needs to be understood.

I agree, that's why I started this issue :)

But you already said that doesn't solve the problem

That's not what I said, it does work-around the issue, but just moving the lock out makes the lock work, now the other data on that cache-line are still out of coherency. The better fix would be to move the offending read data ".parent_node" to it's own cache-line, so all adjacent data stays coherent.

when you enable WARMBOOT_ENABLE_DCACHE_EARLY = 1

That is not the case we are talking about here, I know it can be worked around, we do that already. I'm worried that this may be the sign of a bigger problem. It still seems "access" only includes writes by cores with cache-off, but for us, even reads will break coherency, so even more care must be made when caches are off. This is allowed by the ARM ARM, it's just that it seems to not be needed on most platforms..

Also the upstream code seems to send an sev for waking up the CPU.

Our upstream is not yet complete, we are waiting another week or so to upstream the rest until some ABI changes get made in our system firmware so we don't break any already upstream code. Most of the PSCI will be filled out then, here is a sample that addresses your comment: https://github.com/glneo/arm-trusted-firmware/commit/8db4478a061091858f86ecf219a51be49409e4ca

danh-arm commented 6 years ago

Sorry for the silence on this. Soby is on holiday for 2 weeks and I don't have much bandwidth to help.

Regarding the earlier comment:

This would indicate that non-cached is non-shared. Looking back at the ARM ARM I don't see it explicitly stated what the shareability of non-cached transactions are supposed to be (Maybe I missed it, it is a rather large document..). If you can find that it would help us out (having all the relevant sections collected together would be nice).

Section "B2.7.1 Normal memory" of the Arm ARM states:

"Because all data accesses to Non-cacheable locations are data coherent to all observers, Non-cacheable locations are always treated as Outer Shareable."

So both the originating CPU (with caches on) and the target CPU (with caches off) should be Outer Shareable, which in turn means it should be possible to use software cache management as stated by Soby before, and described in the bullet points following this quote from section "B2.8 Mismatched memory attributes":

"If the mismatched attributes for a memory location all assign the same shareability attribute to a Location that has a cacheable attribute, any loss of uniprocessor semantics, ordering, or coherency within a shareability domain can be avoided by use of software cache management."

After looking at the A53 TRM, I see that a non-cacheable/outer-shareable load will still generate a ReadNoSnoop ACE transaction, the same as it would if this were a non-shareable load. This means the originating CPU has to do cache maintainence on any data it writes that the target CPU will access before enabling caches, otherwise the target CPU could read stale data from main memory.

As far as I can see we do this correctly, but clearly something is going wrong...

danh-arm commented 6 years ago

Actually, I'm wondering if the read of ".parent_node" with caches off is populating a cache line with stale data from main memory (e.g. the spinlock), which is then read when caches are enabled. If so, we should invalidate the spinlock memory (and any other data shared in the cache line), or use a separate cache line as you previously suggested.

Could you try adding an inv_dcache_range just before acquiring the spinlock on the target CPU? If that works, we should check more carefully if there are any other cases where a cache line contains some data which is always coherent, and other data where coherency is managed by software.

danh-arm commented 6 years ago

Please ignore my last comment. Accesses when caches are off should not affect the cache at all. Also, invalidating a location with caches on will invalidate the location for all CPUs in the coherency domain, which is not what we want.

Following some offline discussion, it seems that on this hardware, the read of ".parent_node" with caches off prevents any future snoops to the same cache line after caches are enabled, which is not what we expect. As far we can tell the code is architecturally compliant so we'd rather not add special code to handle this unless there is no other workaround or such special code has low impact.

Let us know if you need more help.