ARMmbed / mbed-os

Arm Mbed OS is a platform operating system designed for the internet of things
https://mbed.com
Other
4.67k stars 2.98k forks source link

DISCO_F746NG QSPI WriteEnable might Fail on IAR8 #10049

Closed offirko closed 5 years ago

offirko commented 5 years ago

Description

Following https://jira.arm.com/browse/IOTSTOR-798 tickect

When running storage tests on DISCO_F746NG with IAR8 it fails on test: features-storage-tests-kvstore-static_tests

Same board and test pass ok on IAR7 , as well as on GCC_ARM and ARM.

The test fails in this line : https://github.com/ARMmbed/mbed-os/blob/master/features/storage/TESTS/kvstore/static_tests/main.cpp#L296

When drilling down the failure is on sending write_enable to QSPI Flash, which eventually fails on timeout: https://github.com/ARMmbed/mbed-os/blob/84e4decad045397b7b28e9ba228df64ff3ffbaec/targets/TARGET_STM/qspi_api.c#L301

Data can not be written afterward to the device… until reset.

The test uses kvstore file system to add key/value pairs which hold the values: “name_a”, “name_b”, “name_c”,…,”name_z”

For some strange reason, the combination of “name_o” followed by “name_p” causes the bug. Even if we skip all the previous entries and only set “name_o” followed by “name_p” it fails.

Issue request type

[ ] Question
[ ] Enhancement
[X] Bug
offirko commented 5 years ago

@jeromecoutant , @adustm : I'd appreciate your inputs Thanks.

ciarmcom commented 5 years ago

Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-990

offirko commented 5 years ago

@TuomoHautamaki - my analysis currently is that after several successful program commands to QSPI flash, a new program command fails on Write Enable. All further program/read/erase commands fail on HAL_BUSY. We need ST and HAL support on this case

offirko commented 5 years ago

@ARMmbed/mbed-os-maintainers - Please assign this issue to STM people

offirko commented 5 years ago

@VVESTM , @jeromecoutant , @adustm : hal qspi get stuck at certain stage of the test at:

https://github.com/ARMmbed/mbed-os/blob/db8a018fece6f57fbddddf0039b93e43557ecaf1/targets/TARGET_STM/TARGET_STM32F7/device/stm32f7xx_hal_qspi.c#L696

Eventually the 5[sec] timeout expires and error state is set:

https://github.com/ARMmbed/mbed-os/blob/db8a018fece6f57fbddddf0039b93e43557ecaf1/targets/TARGET_STM/TARGET_STM32F7/device/stm32f7xx_hal_qspi.c#L2167

The QSPI hal then is stuck, and no read/program/erase commands can be made until the device is reset !

0xc0170 commented 5 years ago

cc @ARMmbed/team-st-mcd

VVESTM commented 5 years ago

There is also something related to the toolchain. Does someone knowing IAR can see what can be the issue ? Can it be a memory corruption ? For information, the problem always occurs at the same place. If we rename variable or change name_a to name_A, the problem moves or "disappear"... Same if we remove optimizations in compiler options.

VVESTM commented 5 years ago

Regarding optimizations, I made a test in develop.json file. We do not see the problem if we remove optimizations on C++ parts : (-On option instead of -Oh) "IAR": { "common": [ "--no_wrap_diagnostics", "-e", "--diag_suppress=Pa050,Pa084,Pa093,Pa082", "--enable_restrict", "-DMBED_TRAP_ERRORS_ENABLED=1"], "asm": [], "c": ["--vla", "--diag_suppress=Pe546", "-Oh"], "cxx": ["--guard_calls", "--no_static_destruction", "-On"], "ld": ["--skip_dynamic_initialization", "--threaded_lib"] } Does it means that problem can be on C++ part ?

jeromecoutant commented 5 years ago

@kjbracey-arm @pan- Could you have a look on questions we have around C++ and IAR ? Thx

VVESTM commented 5 years ago

One more point. On @lmestm side, the test is passed. The difference is the IAR version : Test passed : IAR ELF Linker V8.32.2.178/W32 for ARM (EWARM-CD-8322-19423.exe) Test failing : IAR ELF Linker V8.32.3.193/W32 for ARM (EWARM-CD-8323-20228.exe)

offirko commented 5 years ago

I've noticed there's a known issue for this device in IAR: EWARM-5402, EW26024] Missing FIFO definition for register SPI1->CR2 in the SVD file for ST STM32F746

http://supp.iar.com/FilesPublic/UPDINFO/013240/arm/doc/infocenter/ewarm.ENU.html

offirko commented 5 years ago

@VVESTM - please note the problem is reproduced on my env using: IAR ELF Linker V8.32.1.169/W32 for ARM . Also, I've used "none optimization" cxx setup: "cxx": ["--guard_calls", "--no_static_destruction", "-On"],

And with a bit of code variation, reproduced the problem, this time when trying to set "name_b"

offirko commented 5 years ago

CC: @screamerbg

cmonr commented 5 years ago

@ARMmbed/mbed-os-test @ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers

Fyi: https://github.com/ARMmbed/mbed-os/issues/10049#issuecomment-475669701

offirko commented 5 years ago

@VVESTM - Disabling Data Cache with a call to: SCB_DisableDCache() at begining of the test case resolves the problem. (rest of the setup is default)

(could it after all be related to: https://github.com/ARMmbed/mbed-os/issues/9934#issuecomment-472454548 ?)

kjbracey commented 5 years ago

Although the STM32F7 is vulnerable to cache issues that other boards don't see, I don't believe there's any direct reason for this interface to be vulnerable. It's not being used as a bus-mastering interface like Ethernet, it just has a FIFO you access as programmed memory/mapped I/O, right? Should be no more problematic than the UART. (On the other hand #9934 is quite likely a cache issue).

So the optimisation and cache effects smell to me like a timing issue - maybe you're just slowing it down.

Alternatively, it could be that the cache change is a red-herring, and that it's just the act of inserting the call that moves code around again. :/

It's possible there's a compiler bug, or some code triggering undefined behaviour only in this compiler, but we'd need to pin down a bit closer what's actually going wrong.

There must be one initial transfer that times out - for that transfer we'd want to see how the peripheral had been programmed. Did we program incorrect values? If so, where did those incorrect values come from? Is the hardware signalling something that we're missing? We're waiting for the TC flag - is it signalling TE?

If there ever is a timeout, as was pointed out above, the state gets locked into "error", so it never works again. Is that reasonable? Is this supposed to be a reliable interface?

dannybenor commented 5 years ago

@VVESTM We see that this issue is reproducible but also is fragile, meaning small changes to the test, like adding prints, or playing with the cache, will "fix" the problem. We need your help in the investigation of the root cause why the QSPI get stuck.

VVESTM commented 5 years ago

@dannybenor, I am working on this issue. I come back when I have news.

jeromecoutant commented 5 years ago

ST_INTERNAL_REF 64387