Closed gsomlo closed 3 years ago
Stack trace for me after this patch on mor1kx
:
Branch: https://github.com/stffrdhrn/linux/commits/or1k-5.12-litex-net
diff --git a/drivers/mmc/host/litex_mmc.c b/drivers/mmc/host/litex_mmc.c
index 9bdc088dd100..1ef1fc9d7560 100644
--- a/drivers/mmc/host/litex_mmc.c
+++ b/drivers/mmc/host/litex_mmc.c
@@ -411,11 +412,12 @@ static int litex_mmc_probe(struct platform_device *pdev)
* However, using multi-block transfers results in DMA timeout errors,
* which is something that requires further investigation!
*/
+#if 0
mmc->max_blk_count = 1; /* only single-block transfers currently work */
/* recalculate max_[req,seg]_size given new max_blk_count */
mmc->max_req_size = mmc->max_blk_size * mmc->max_blk_count;
mmc->max_seg_size = mmc->max_req_size;
-
+#endif
host = mmc_priv(mmc);
host->mmc = mmc;
host->dev = pdev;
Crash
# [ 472.070000] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1H:28]
[ 472.070000] CPU: 0 PID: 28 Comm: kworker/0:1H Not tainted 5.11.0-rc3-00014-gb563c3e3c504-dirty #436
[ 472.070000] Workqueue: kblockd blk_mq_run_work_fn
[ 472.070000] CPU #: 0
[ 472.070000] PC: c0322f0c SR: 0000807f SP: c1905c68
[ 472.070000] GPR00: 00000000 GPR01: c1905c68 GPR02: c1905c70 GPR03: 00000000
[ 472.070000] GPR04: 00000001 GPR05: 00000000 GPR06: 00000000 GPR07: 00000002
[ 472.070000] GPR08: 00000001 GPR09: c0323024 GPR10: c1904000 GPR11: 000001f4
[ 472.070000] GPR12: 00f98300 GPR13: 00000000 GPR14: 04288000 GPR15: 00000000
[ 472.070000] GPR16: bc01801c GPR17: 00000018 GPR18: 000053e3 GPR19: fdefadb9
[ 472.070000] GPR20: 00000001 GPR21: 00077c40 GPR22: 00000001 GPR23: 000001e0
[ 472.070000] GPR24: c4284270 GPR25: 00000005 GPR26: 00000002 GPR27: 00000005
[ 472.070000] GPR28: 00002000 GPR29: 00000000 GPR30: 00000002 GPR31: 00000008
[ 472.070000] RES: 000001f4 oGPR11: ffffffff
[ 472.070000] Process kworker/0:1H (pid: 28, stackpage=c1900000)
[ 472.070000]
[ 472.070000] Stack:
[ 472.070000] Call trace:
[ 472.070000] [<(ptrval)>] __const_udelay+0xb4/0xc4
[ 472.070000] [<(ptrval)>] sdcard_wait_done+0x34/0x94
[ 472.070000] [<(ptrval)>] send_cmd+0x8c/0x228
[ 472.070000] [<(ptrval)>] litex_request+0x1bc/0x400
[ 472.070000] [<(ptrval)>] __mmc_start_request+0x10c/0x170
[ 472.070000] [<(ptrval)>] mmc_start_request+0xc8/0xe8
[ 472.070000] [<(ptrval)>] mmc_blk_mq_issue_rq+0x7e8/0x8a0
[ 472.070000] [<(ptrval)>] ? sbitmap_get+0xe0/0x13c
[ 472.070000] [<(ptrval)>] mmc_mq_queue_rq+0x250/0x2ec
[ 472.070000] [<(ptrval)>] blk_mq_dispatch_rq_list+0x49c/0x67c
[ 472.070000] [<(ptrval)>] ? elv_rb_del+0x74/0x90
[ 472.070000] [<(ptrval)>] ? elv_rqhash_del+0x30/0x48
[ 472.070000] [<(ptrval)>] blk_mq_do_dispatch_sched+0x2f4/0x30c
[ 472.070000] [<(ptrval)>] ? sched_rq_cmp+0x0/0x38
[ 472.070000] [<(ptrval)>] ? update_cfs_rq_load_avg+0x1a0/0x1d4
[ 472.070000] [<(ptrval)>] __blk_mq_sched_dispatch_requests+0x110/0x198
[ 472.070000] [<(ptrval)>] blk_mq_sched_dispatch_requests+0x60/0xac
[ 472.070000] [<(ptrval)>] __blk_mq_run_hw_queue+0x98/0xb8
[ 472.070000] [<(ptrval)>] blk_mq_run_work_fn+0x2c/0x44
[ 472.070000] [<(ptrval)>] process_one_work+0x208/0x2ac
[ 472.070000] [<(ptrval)>] worker_thread+0x32c/0x3dc
[ 472.070000] [<(ptrval)>] ? worker_thread+0x0/0x3dc
[ 472.070000] [<(ptrval)>] kthread+0x130/0x138
[ 472.070000] [<(ptrval)>] ? kthread+0x0/0x138
[ 472.070000] [<(ptrval)>] ret_from_fork+0x1c/0x9c
@gsomlo: The linux driver is also probably re-configuring the SDCard, so I'm wondering if the different behaviour could also be related to a different configuration of the SDCard. Doing capture of the SDCard signals would help on this, I'll try to do some soon.
So during the first "bad" cmd18, this is never true, not even once.
source.ready
is asserted by SDCORE
, so it should be true throughout the entire CMD
state of SDPHYCMDR
.
Which means that source.valid
(provided by SDPHYR
drops back to 0 after having been asserted long enough to get us out of the WAIT
state of SDPHYCMDR
.
The question is, why would it do that... :)
The linux driver is also probably re-configuring the SDCard
Here's the list of command(argument)
values being sent to the SDCard by Linux, all the way through to the first failing cmd18
(it's actually the second overall cmd18
that fails):
[ 3.074876] 0: cmd(arg) = 0(0x00000000)
[ 3.076771] 1: cmd(arg) = 8(0x000001aa)
[ 3.077807] 2: cmd(arg) = 55(0x00000000)
[ 3.079627] 3: cmd(arg) = 41(0x00000000)
[ 3.080656] 4: cmd(arg) = 0(0x00000000)
[ 3.082441] 5: cmd(arg) = 8(0x000001aa)
[ 3.084448] 6: cmd(arg) = 55(0x00000000)
[ 3.085649] 7: cmd(arg) = 41(0x40300000)
[ 3.087543] 8: cmd(arg) = 55(0x00000000)
[ 3.088605] 9: cmd(arg) = 41(0x40300000)
[ 3.090520] 10: cmd(arg) = 55(0x00000000)
[ 3.091632] 11: cmd(arg) = 41(0x40300000)
[ 3.093553] 12: cmd(arg) = 2(0x00000000)
[ 3.094624] 13: cmd(arg) = 3(0x00000000)
[ 3.096440] 14: cmd(arg) = 9(0xd67d0000)
[ 3.097463] 15: cmd(arg) = 7(0xd67d0000)
[ 3.099271] 16: cmd(arg) = 55(0xd67d0000)
[ 3.100245] 17: cmd(arg) = 6(0x00000002)
[ 3.102291] 18: cmd(arg) = 55(0xd67d0000)
[ 3.103495] 19: cmd(arg) = 51(0x00000000)
[ 3.104562] 20: cmd(arg) = 55(0xd67d0000)
[ 3.106389] 21: cmd(arg) = 13(0x00000000)
[ 3.107511] 22: cmd(arg) = 6(0x00fffff0)
[ 3.109358] 23: cmd(arg) = 55(0xd67d0000)
[ 3.110555] 24: cmd(arg) = 6(0x00000002)
[ 3.112438] 25: cmd(arg) = 18(0x00000000)
[ 3.113659] 26: cmd(arg) = 18(0x00000002) <--- This one is the first to fail with a timeout!
@enjoy-digital: any evidence in that sequence of Linux doing something drastically different from the LiteX bios, in terms of configuring the SDCard?
and, in contrast with the timeout errors encountered with multi-block reads (cmd18) in linux, here's a log of the command sequence issued by the LiteX bios while successfully loading bios.bin
from the sdcard with the sdcardboot
command:
litex> sdcardboot
Booting from SDCard in SD-Mode...
Booting from boot.json...
#=- 0 cmd(arg) = 0(0x00000000)
#=- 1 cmd(arg) = 8(0x000001aa)
#=- 2 cmd(arg) = 55(0x00000000)
#=- 3 cmd(arg) = 41(0x70ff8000)
#=- 4 cmd(arg) = 55(0x00000000)
#=- 5 cmd(arg) = 41(0x70ff8000)
#=- 6 cmd(arg) = 55(0x00000000)
#=- 7 cmd(arg) = 41(0x70ff8000)
#=- 8 cmd(arg) = 55(0x00000000)
#=- 9 cmd(arg) = 41(0x70ff8000)
#=- 10 cmd(arg) = 55(0x00000000)
#=- 11 cmd(arg) = 41(0x70ff8000)
#=- 12 cmd(arg) = 55(0x00000000)
#=- 13 cmd(arg) = 41(0x70ff8000)
#=- 14 cmd(arg) = 55(0x00000000)
#=- 15 cmd(arg) = 41(0x70ff8000)
#=- 16 cmd(arg) = 55(0x00000000)
#=- 17 cmd(arg) = 41(0x70ff8000)
#=- 18 cmd(arg) = 55(0x00000000)
#=- 19 cmd(arg) = 41(0x70ff8000)
#=- 20 cmd(arg) = 2(0x00000000)
#=- 21 cmd(arg) = 3(0x00000000)
#=- 22 cmd(arg) = 10(0xd67d0000)
#=- 23 cmd(arg) = 9(0xd67d0000)
#=- 24 cmd(arg) = 7(0xd67d0000)
#=- 25 cmd(arg) = 55(0xd67d0000)
#=- 26 cmd(arg) = 6(0x00000002)
#=- 27 cmd(arg) = 6(0x80fffff1)
#=- 28 cmd(arg) = 55(0xd67d0000)
#=- 29 cmd(arg) = 51(0x00000000)
#=- 30 cmd(arg) = 16(0x00000200)
Booting from boot.bin...
#=- 31 cmd(arg) = 0(0x00000000)
#=- 32 cmd(arg) = 8(0x000001aa)
#=- 33 cmd(arg) = 55(0x00000000)
#=- 34 cmd(arg) = 41(0x70ff8000)
#=- 35 cmd(arg) = 55(0x00000000)
#=- 36 cmd(arg) = 41(0x70ff8000)
#=- 37 cmd(arg) = 55(0x00000000)
#=- 38 cmd(arg) = 41(0x70ff8000)
#=- 39 cmd(arg) = 55(0x00000000)
#=- 40 cmd(arg) = 41(0x70ff8000)
#=- 41 cmd(arg) = 55(0x00000000)
#=- 42 cmd(arg) = 41(0x70ff8000)
#=- 43 cmd(arg) = 55(0x00000000)
#=- 44 cmd(arg) = 41(0x70ff8000)
#=- 45 cmd(arg) = 55(0x00000000)
#=- 46 cmd(arg) = 41(0x70ff8000)
#=- 47 cmd(arg) = 55(0x00000000)
#=- 48 cmd(arg) = 41(0x70ff8000)
#=- 49 cmd(arg) = 2(0x00000000)
#=- 50 cmd(arg) = 3(0x00000000)
#=- 51 cmd(arg) = 10(0xd67d0000)
#=- 52 cmd(arg) = 9(0xd67d0000)
#=- 53 cmd(arg) = 7(0xd67d0000)
#=- 54 cmd(arg) = 55(0xd67d0000)
#=- 55 cmd(arg) = 6(0x00000002)
#=- 56 cmd(arg) = 6(0x80fffff1)
#=- 57 cmd(arg) = 55(0xd67d0000)
#=- 58 cmd(arg) = 51(0x00000000)
#=- 59 cmd(arg) = 16(0x00000200)
#=- 60 cmd(arg) = 18(0x00000000)
#=- 61 cmd(arg) = 12(0x00000000)
#=- 62 cmd(arg) = 18(0x00000800)
#=- 63 cmd(arg) = 12(0x00000000)
#=- 64 cmd(arg) = 18(0x00001820)
#=- 65 cmd(arg) = 12(0x00000000)
Copying boot.bin to 0x80000000 (15701872 bytes)...
#=- 66 cmd(arg) = 18(0x00001828)
#=- 67 cmd(arg) = 12(0x00000000)
#=- 68 cmd(arg) = 18(0x00000820)
#=- 69 cmd(arg) = 12(0x00000000)
#=- 70 cmd(arg) = 18(0x00001830)
#=- 71 cmd(arg) = 12(0x00000000)
#=- 72 cmd(arg) = 18(0x00001838)
#=- 73 cmd(arg) = 12(0x00000000)
...
Two things stick out for me (so far) between successful bios and unsuccessful Linux command(argument) sequences:
ACMD41
(0x70ff8000
is hardcoded in liblitesdcard/sdcard.c
in LiteX; no idea so far where/how Linux determines its own argument value):
acmd41 arg:
bios: 0x70ff8000
linux: 0x40300000
^ ^----^
| |
| + Vdd voltage window
|
+-- bios: HCS/OCR, eSD, XPC
|
+--linux: HCS/OCR, n/a, n/a
sdcard_init()
routine sends CMD16
(as sdcard_app_set_blocklen(512)
); There is no corresponding CMD16
in the Linux trace. Not super clear what the default value would be -- looking into how I'd convince Linux to issue that command explicitly, to test whether it would make a difference.EDIT1: forcing an additional CMD16
here does NOT seem to help, getting the same timeout.
EDIT2: brute-forcing 0x70ff8000
as the argument to CMD41
also doesn't appear to help.
So, there must be something else that causes the card to "hang" under Linux...
I have a few days to look at this. One thing I also notice is that the first CMD18
in linux 18(0x00000000)
may not be transferring data at all as there is no corresponding stop 12(0x00000000)
command.
So, it may be that with the first CMD18
, we are already going wrong.
I am not making much progress, as I seem to be seeing similar issues in the BIOS when running mor1kx
. Transfers are hanging after the first few blocks.
My setup is here: https://github.com/stffrdhrn/or1k-utils/blob/master/litex/litex.build
Instructions, download my setup script and change variables to point to your litex path's if needed. Prereq:
JD
port Then run the script:
./litex.build --build
./litex.build --load
I have added some debug statements and am seeing.
m0, b7: |00000000000000000000000000000000| delays: -
best: m0, b01 delays: 20+-07
m1, b0: |00000000000000000000000000000000| delays: -
m1, b1: |00000000000001111111111111110000| delays: 20+-07
m1, b2: |00000000000000000000000000000111| delays: 30+-01
m1, b3: |00000000000000000000000000000000| delays: -
m1, b4: |00000000000000000000000000000000| delays: -
m1, b5: |00000000000000000000000000000000| delays: -
m1, b6: |00000000000000000000000000000000| delays: -
m1, b7: |00000000000000000000000000000000| delays: -
best: m1, b01 delays: 20+-07
Switching SDRAM to hardware control.
Memtest at 0x000000 (2MiB)...
Write: 0x0-0x200000 2MiB
Read: 0x0-0x200000 2MiB
Memtest OK
Memspeed at 0x000000 (2MiB)...
Write speed: 37MiB/s
Read speed: 25MiB/s
--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
Booting from SDCard in SD-Mode...
Booting from boot.json...
Setting SDCard clk freq to 390 KHz
Setting SDCard clk freq to 12 MHz
sdcard init done!
disk_initialize: sdcardstatus = 0 (0 means ready)
disk_read: reading block = 0, count = 1
check_fs: Bad sig got: ffff not aa55 # fatFS - reading block 0 first time failed
find_volume: check_fs result 3
mount_volume: volume format 3
sdcardboot_from_json: mount failed! (err: 13)
Booting from boot.bin...
disk_initialize: sdcardstatus = 0 (0 means ready)
disk_read: reading block = 0, count = 1
check_fs: Good sig: aa55 # fatFS - reading block 0 second time OK
check_fs: Jumpboot: fa
find_volume: check_fs result 2
disk_read: reading block = 63, count = 1 # fatFS - reading block 63 OK
check_fs: Good sig: aa55
mount_volume: volume format 0
mount_volume: complete
disk_read: reading block = 591, count = 1
Copying boot.bin to 0x00000000 (6020716 bytes)...
[
disk_read: reading block = 3119, count = 16 # sdcard - reading stops here
Another note, If I add some printfs into sdcard_read
the multiblock reads are successful and it cat boot, however the first read of boot.json
still fails.
So this seems to show to issues
0xffff
, is read when we access block 0 the first timeAttached the bitstream as mentioned above. It contains the bios with my debug statements as well. I can try to remove jtagbone and the bios patches to make it similar to what you can produce.
I am wondering if my 0xffff
issue could be something to do with the sram, maybe software stack corruption or actually an issue with my arty's sram? I doubt there is actually a hardware issue though as even with different bitstreams the same issue occurs, I expect with different bitstreams the sram cells wired in would be different.
For the json read, the FatFS fs->win
buffer is at sram address 0x50007940
. Writing to that always ends up with in 0xffff
in the final sector magic number bytes.
For the .bin read, the FatFS fs->win
buffer is at sram address 0x50007d54
. The writes to that address always are successful.
I am wondering if my
0xffff
issue could be something to do with the sram, maybe software stack corruption or actually an issue with my arty's sram? I doubt there is actually a hardware issue though as even with different bitstreams the same issue occurs, I expect with different bitstreams the sram cells wired in would be different.
This seems wrong. I can see the data being read from the sdcard is showing 0xffff
.
From the litescope capture capturing when the DMA is writing that last few words of the block to memory: litescope_cli -v sdblock2mem_wishbonedmawriter_offset_status 125
sdcore is outputting FF
and dma is writing FFFFFFFF
to sram.
Compare this to the first few words of the DMA transfer: litescope_cli -v sdblock2mem_wishbonedmawriter_offset_status 1
sdcore outputs some data, it looks similar to block 0 of the sdcard but doesn't match.
This is the content of block one on my sdcard. The below is from the bios command line which works fine. During boot the dma copy seems to be copying incorrect data.
litex> sdcard_read 0
sdcard_read: read multiple b: 0 c: 1 to buf: 50007d54
Memory dump:
0x50007d54 fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 ................
0x50007d64 fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00 ...|.........!..
0x50007d74 00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75 ....8.u........u
0x50007d84 f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b .........|...t..
0x50007d94 4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00 L.....|.........
0x50007da4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007db4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007dc4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007dd4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007de4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007df4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e24 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e34 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e44 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e54 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e64 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e74 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e84 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007e94 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007ea4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007eb4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007ec4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007ed4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007ee4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007ef4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x50007f04 00 00 00 00 00 00 00 00 0b 84 7a 60 00 00 80 01 ..........z`....
0x50007f14 01 00 0b fe c2 ff 3f 00 00 00 02 42 0f 00 00 cc ......?....B....
0x50007f24 04 41 82 a3 05 83 41 42 0f 00 40 42 0f 00 00 b1 .A....AB..@B....
0x50007f34 21 83 83 ec fc ea 00 88 1e 00 00 f0 c9 00 00 00 !...............
0x50007f44 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa ..............U.
If I dump the memory in sram where dma copied block 0 we see the data that matches what litescope captured. So it seems my capture is correct. However, the data doesn't match anything in the sdcard, I'll try to capture the PHY, the data so different reminds me of a clock-domain crossing issue.
litex.run $PWD/dump-mem.py 0x50007940 512
Reading memory...
Connected with 127.0.0.1:32974
50007940: 00c88001 80018001 8001c001 80030000
50007950: 01000000 00000000 00000000 00000000
50007960: 00000000 00000000 00000000 00000000
50007970: 00000000 00000000 00000000 00000000
50007980: 4053dfb4 2cffc64c ffffffff ffffffff
50007990: ffffffff ffffffff ffffffff ffffffff
500079a0: ffffffff ffffffff ffffffff ffffffff
500079b0: ffffffff ffffffff ffffffff ffffffff
...
The multi-block transfer issue in the bios is fixed by putting a bit of sleep between the read command and the polling loop. Here is my patch with debug statements and all. I wonder if trying somethig similar in linux would work.
The 10us sleep is completely arbitrary.
diff --git a/litex/soc/software/liblitesdcard/sdcard.c b/litex/soc/software/liblitesdcard/sdcard.c
index 1cb74b4a..8cb7c683 100644
--- a/litex/soc/software/liblitesdcard/sdcard.c
+++ b/litex/soc/software/liblitesdcard/sdcard.c
@@ -482,11 +491,18 @@ void sdcard_read(uint32_t block, uint32_t count, uint8_t* buf)
#ifdef SDCARD_CMD23_SUPPORT
sdcard_set_block_count(count);
#endif
+ //printf ("sdcard_read: read multiple b: %d c: %d to buf: %lx\n", block, count, (unsigned long) buf);
sdcard_read_multiple_block(block, count);
+ // Uncommenting this line fixes multiple dma transfer
+ // printf ("sdcard_read: wait dma done\n");
+ // Trying similar with a sleep, this works too
+ busy_wait_us(10);
+
/* Wait for DMA Writer to complete */
while ((sdblock2mem_dma_done_read() & 0x1) == 0);
+ //printf ("sdcard_read: stop transition\n");
sdcard_stop_transmission();
#ifndef CONFIG_CPU_HAS_DMA_BUS
@stffrdhrn: AFAICT, a typical sdcard command happens in three stages:
cmdevt
register to go to 0x01
(any other bit set, and there's an error -- e.g., timeout)dataevt
register to go to 0x01
(same deal as with cmdevt
above in terms of additional error bits)sdblk2mem_done
or sdmem2blk_done
, depending on whether it's a read or write)Your fix comes in at step 3, but what I encountered on Linux was an error at step 1 (where cmdevt
contained a timeout bit, making anything that might happen afterwards irrelevant). So chances are we might be dealing with different bugs, or different manifestations of some common underlying problem...
Note, I can also workaround the boot from json issue with the following patch. So for me something seems to fail with the first block read. Adding sleep before the first block read doesn't seem to work.
diff --git a/litex/soc/software/bios/boot.c b/litex/soc/software/bios/boot.c
index 501b8a4d..1f7910b3 100644
--- a/litex/soc/software/bios/boot.c
+++ b/litex/soc/software/bios/boot.c
@@ -758,6 +760,9 @@ void sdcardboot(void)
/* Boot from boot.json */
printf("Booting from boot.json...\n");
sdcardboot_from_json("boot.json");
+ printf("Booting from boot.json...\n");
+ sdcardboot_from_json("boot.json");
+
/* Boot from boot.bin */
printf("Booting from boot.bin...\n");
I'm still trying to understand the LiteSDCard gateware FSMs, as directed by chasing down the bug I'm encountering on 64-bit rocket Linux (i.e., the command response timeout I get whenever I try using cmd18 multi-block reads).
The symptom, as far as I was able to trace things before getting stuck, is that during the command-read phase, the cmdr module's source's valid signal goes high for precisely a single clock cycle and then never again during the SDPHYCMDR FSM's CMD state until the timeout counter runs out. I tried enforcing that cmdr.source_valid
should be asserted for two (or three) cycles before we move to the CMD state, but ended up timing out in the WAIT state instead -- proof that the signal is only asserted once and then goes away.
I'm trying to wrap my head around the intricacies of the stream _UpConverter and PipeValid modules used by SDPHYCMDR.cmdr a.k.a. SDPHYR but, in the mean time, any idea as to why cmdr would assert source.valid for only a single cycle, and then turn it right back off (specifically during a cmd-18, and not some other sdcard command), would probably help a lot toward getting me unstuck...
BTW, I also tried removing the PipeValid
(a.k.a. buf portion of the SDPHYR pipeline), and while things sort-of limped along (boot from sdcard in the bios got stuck halfway through), the Linux driver timed out in exactly the same spot as always, so that didn't make any (positive) difference -- from a quick skim through the sources, the PipeValid
stage seemed like a timing optimization that I could in theory remove without completely breaking things. I might have gotten that wrong, in which case let's forget I even tried :)
Also, any pointers to whatever documentation was used as the basis for LiteSDCard's behavior would be super useful, as I feel the only useful thing I can do ATM is study them in hope of finding some new thread to proverbially pull at... :)
Hi @gsomlo, @stffrdhrn,
thanks a lot for looking at this and sorry for not having very reactive. I had a look at it this morning and the issue @stffrdhrn is seeing with boot.json failing and multi-block transfer with the BIOS seems to be caused by setting cmd_done
early in https://github.com/enjoy-digital/litesdcard/pull/23. I'm still not sure why this change has a functional effect, but it could be that sdcard_app_send_scr()
is returning data at the end of the initialization and we are not waiting it to be received before sending the next command (And I think the SCR data were the first corrupted data you were seeing @stffrdhrn with LiteScope).
I'm going to have another look at this, but for now I already reverted the change, also added some comments to core.py
in the code and CSR description with:
@gsomlo: I'll have another look later at your other questions, maybe we should try to work together directly on this to speed up things.
@enjoy-digital , Thank's I confirm with your fixes and all of my patches removed (current litex upstream) BIOS sdcard boots. I will continue to help with investigating @gsomlo's findings for the multiblock support in the kernel now,
...
Memspeed at 0x000000 (2MiB)...
Write speed: 37MiB/s
Read speed: 25MiB/s
--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
Booting from SDCard in SD-Mode...
Booting from boot.json...
Copying boot.bin to 0x00000000 (6020716 bytes)...
[########################################]
Copying litex-mor1kx.dtb to 0x007f0000 (1495 bytes)...
[########################################]
Copying rootfs.cpio.gz to 0x00800000 (12396156 bytes)...
[########################################]
Executing booted program at 0x00000000
--============= Liftoff! ===============--
[ 0.000000] FDT at (ptrval)
[ 0.000000] Linux version 5.12.0-rc2-00003-g316b0cfb6805 (shorne@lianli.shorne-pla.net) (or1k-elf-gcc (GCC) 9.0.1 20190409 (experimental), GNU ld (GNU Binutils) 2.32.51.20190425) #445 Wed Mar 10 13:37:37 JST 2021
[ 0.000000] CPU: OpenRISC-10 (revision 0) @100 MHz
...
Thanks for the test @stffrdhrn, I'll also try to help for the multiblock support. Feel free to continue sharing your findings here, it's very useful to understand the issue and work on it together.
@gsomlo: I think this has been solved with all the recent improvements? Do you still see limitations on this or should we close?
@enjoy-digital @stffrdhrn @paulusmack : as of litesdcard commit 194a58c7fbeb5437ce54e363427ed873c1bc257a, multi-block reads and writes work fine at the full sdclock frequency requested by the linux driver (and I have also removed the additional divider-shift workaround from the driver in the litex-rebase branch). Thanks again to everyone involved -- closing.
Thanks very much!
Reproducer:
nexys4ddr
board (would be curious if a similar issue occurs on LiteX SoCs built with other CPUs, e.g. vexriscv).max_blk_count = 2
here.Actual Result:
command timeout
during multi-block (cmd-18
) reads from the sdcard, which eventually result in the filesystem layer giving up with a read error.Expected result: similar to the bare-metal bios (and regardless of potentially different timing of actual requests, and less efficient collection and/or setup of DMA buffers), SDCard reads and writes should complete without error :)
NOTE: I'll edit this as I dig deeper into the PHY FSMs, unless I get pre-empted by someone else figuring out what the problem is... @enjoy-digital @stffrdhrn -- PTAL.