BCMmodule / Hardware

Repository for BCM1 and BCS1 modules
31 stars 11 forks source link

Intermittent eMMC problems with BCM1 #4

Closed avian2 closed 7 years ago

avian2 commented 7 years ago

Hi

We have designed a custom board around the BCM1 module. 2 out of 3 prototypes we made so far are suffering from intermittent problems with the on-board eMMC. These problems manifest themselves when using eMMC (mmcblk1) as the Linux root filesystem, apparently at random. We see kernel messages like the following on the serial console. When this happens, all eMMC operations result in errors and the board does not recover on its own. Normal operation is only resumed after a power cycle.

[73039.454537] mmcblk1: timed out sending r/w cmd command, card status 0xe00
[73039.461571] blk_update_request: I/O error, dev mmcblk1, sector 3426864
[73039.469737] Aborting journal on device mmcblk1p1-8.
[73039.506790] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.572589] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.638407] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.706269] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.779561] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.919481] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73039.926706] blk_update_request: I/O error, dev mmcblk1, sector 3416064
[73039.933373] Buffer I/O error on dev mmcblk1p1, logical block 425984, lost sync page write
[73039.941734] JBD2: Error -5 detected when updating journal superblock for mmcblk1p1-8.
[73039.984706] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.018083] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.051329] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.084700] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.117974] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.151030] mmcblk1: timed out sending r/w cmd command, card status 0x400e00
[73040.158163] blk_update_request: I/O error, dev mmcblk1, sector 8192
[73040.164521] Buffer I/O error on dev mmcblk1p1, logical block 0, lost sync page write
[73040.172608] EXT4-fs error (device mmcblk1p1): ext4_journal_check_start:56: Detected aborted journal
[73040.181860] EXT4-fs (mmcblk1p1): Remounting filesystem read-only
[73040.187935] EXT4-fs (mmcblk1p1): previous I/O error to superblock detected

The affected boards will sometimes also not boot after a power cycle. In that case, only a series of CCCCC characters is seen on the serial console (no normal bootloader messages). For the two affected boards, this happens approximately once per day. The third board has been running continuously for many days without encountering this problem.

At the moment we cannot reliably reproduce the problem in testing. Loading the CPU, eMMC and SD card interfaces does not seem to trigger the problem. After a power cycle, extensively testing the eMMC flash with badblocks (read-only and r/w tests) does not find any failures.

We are using the 4.4.30-ti-r64 kernel and a userland based on the bone-debian-8.6-seeed-iot-armhf-2016-11-06-4gb.img image. We use a custom device tree. However the fact that these boards will occasionally not reach the bootloader stage suggests that this is not a problem with the Linux system.

We found similar problems to what we are experiencing described here (a custom AM335x-based system) and here (BeagleBone Black). The first discussion suggests that the cause is a hardware problem on the MMC bus, however they are using an SD card and the reported card status number is different.

We don't rule out a problem with our hardware design. However the connection between the AM335x and MMC flash runs on the BCM1 module. We don't currently see any way how a fault in our design could cause these symptoms. We suspected the power supply but we see no immediately apparent problems. However we couldn't find the datasheet for the specific Hynix eMMC flash IC used on BCM1 so we can't say for sure the supply voltage variations we see are in spec.

We would appreciate any help with this issue.

BCMmodule commented 7 years ago

Tomaž,

Thank you for your detailed description. Using a standard BeagleCore BCS1 or alike with BCM1 we do not see these issues. We would need to know about the custom baseboard of yours in detail and the software-set you are running, maybe even a prototype for testing in order to provide more qualified information.

By the way: We have almost simultaneously received a very similar request by one of your colleagues via email. The datasheet for eMMC by SK Hynix is not easy to obtain and we are not allowed to publish it here (SK hynix 1xnm_32Gb based 4GB_Rev1 0.pdf). We have emailed it to your colleague on Friday.

Hope this helps.

avian2 commented 7 years ago

Sorry, I was not aware that @urbangregorc was already in contact with you over email. I'll leave further communication regarding this issue to him then.

Regarding Hynix datasheet, my colleague has indeed received it, thank you. From our current measurements it appears that power supply variations we see are with-in the specs for the eMMC chip.

BCMmodule commented 7 years ago

I have closed this issue, since communication has shifted to email. If neccessary, please re-open.

cml12 commented 6 years ago

Hi,

We are in the same situation as avian2.

mmcblk1: error -110 sending status command, retrying mmcblk1: timed out sending r/w cmd command, card status 0xe40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmc1: tried to reset card blk_update_request: I/O error, dev mmcblk1, sector 0 Buffer I/O error on dev mmcblk1, logical block 0, lost async page write mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmcblk1: timed out sending r/w cmd command, card status 0x400e40 mmc1: tried to reset card

Any help is appreciated!

BCMmodule commented 6 years ago

Hi cml12,

we can send you the datasheet for eMMC by SK Hynix by email if you send your contact details to core@beaglecore.com. Since we are not allowed to publish it here this is the only way I can offer you right now. Otherwise we will need more detail about the baseboard, etc. - like stated in our answer above.

urbangregorc commented 6 years ago

Hey

In our case the eMMC problem was related to unregulated output voltage ( VDD_3V3B) from the TL5209, which is the supply voltage for eMMC. When we observed the VDD_3V3B for longer periods of time, we noticed, that at some randome time the voltage started to oscillate. After analysing TL5209 and our customly designed board we realized this two things:

So our solution was to connect dummy resistor (330Ohm should be more than enough) to VDD_3V3B, in order to properly load TL5209 and make its output voltage stable all the time.

Hope this helps :)

On 15 June 2018 at 09:10, BeagleCore notifications@github.com wrote:

Hi cml12,

we can send you the datasheet for eMMC by SK Hynix by email if you send your contact details to core@beaglecore.com. Since we are not allowed to publish it here this is the only way I can offer you right now. Otherwise we will need more detail about the baseboard, etc. - like stated in our answer above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BeagleCore/Hardware/issues/4#issuecomment-397533809, or mute the thread https://github.com/notifications/unsubscribe-auth/AKVaaUyf6NUJEeUXw7IcQq2VXFS2H-LYks5t813fgaJpZM4PRRzh .

stephanecharette commented 3 years ago

We have a dozen beaglebone blacks installed for several years, and recently 3 of them are showing similar things with the "C" printing on console. Plain BBB, no modifications, no capes, no usb devices. When I try to reboot, the serial console shows:

(initramfs) reboot
[  154.646564] mmc1: cache flush error -110
[  154.655985] reboot: Restarting system
CCCCCCCC

Then it hangs there forever. When I disconnect them and power up, I get lots of these errors:

[   17.147031] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.204816] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.261987] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.319158] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.376335] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.433507] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.440609] Buffer I/O error on dev mmcblk0p1, logical block 0, async page read
[   17.498218] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.555404] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.612578] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.669748] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.726916] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.784086] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   17.791186] Buffer I/O error on dev mmcblk0p1, logical block 0, async page read
fsck.ext4: Attempt to read block from filesystem resulted in short read while trying to re-open BOOT

BOOT: ********** WARNING: Filesystem still has errors **********

fsck exited with status code 12
The root filesystem on /dev/mmcblk0p1 requires a manual fsck

Attempting to run fsck did not help:

BusyBox v1.22.1 (Ubuntu 1:1.22.0-15ubuntu1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) fsck /dev/mmcblk0p1
fsck from util-linux 2.27.1
[   49.620684] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.678084] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.735277] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.792451] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.849622] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.906847] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   49.913961] blk_update_request: I/O error, dev mmcblk0, sector 7552896
[   49.970796] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.027994] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.085174] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.142346] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.199517] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.256687] mmcblk0: timed out sending r/w cmd command, card status 0x400e40
[   50.263782] blk_update_request: I/O error, dev mmcblk0, sector 7552896
[   50.270355] Buffer I/O error on dev mmcblk0p1, logical block 943856, async page read
fsck: error 2 (No such file or directory) while executing fsck.ext2 for /dev/mmcblk0p1
(initramfs) 
christianhalter commented 3 years ago

Hi,

I have the same problem than @avian2 , but I'm not using a Beagle board, my board is an adaptation of an Olimex, but that's not the point. I don't know if the issue is power related or even if it can be solved by a software patch. It happens in a very few boards and it goes away when I restart the board. It appears again only when I discharge all the capacitors and power on the board. I tried @urbangregorc solution but it didn't worked. Does anyone have another solution? Thanks in advance

christianhalter commented 3 years ago

Hi,

I have the same problem than @avian2 , but I'm not using a Beagle board, my board is an adaptation of an Olimex, but that's not the point. I don't know if the issue is power related or even if it can be solved by a software patch. It happens in a very few boards and it goes away when I restart the board. It appears again only when I discharge all the capacitors and power on the board. I tried @urbangregorc solution but it didn't worked. Does anyone have another solution? Thanks in advance

Update: I solved the issue adding a weak pull down in the eMMC's hardware reset pin. Hope this help somebody!