enjoy-digital / litesdcard

Small footprint and configurable SDCard core
Other
107 stars 35 forks source link

Linux timeouts when enabling multi-block transfers (e.g., cmd18) #22

Closed gsomlo closed 3 years ago

gsomlo commented 3 years ago

Reproducer:

Actual Result:

Expected result: similar to the bare-metal bios (and regardless of potentially different timing of actual requests, and less efficient collection and/or setup of DMA buffers), SDCard reads and writes should complete without error :)

NOTE: I'll edit this as I dig deeper into the PHY FSMs, unless I get pre-empted by someone else figuring out what the problem is... @enjoy-digital @stffrdhrn -- PTAL.

stffrdhrn commented 3 years ago

Stack trace for me after this patch on mor1kx:

Branch: https://github.com/stffrdhrn/linux/commits/or1k-5.12-litex-net

diff --git a/drivers/mmc/host/litex_mmc.c b/drivers/mmc/host/litex_mmc.c
index 9bdc088dd100..1ef1fc9d7560 100644
--- a/drivers/mmc/host/litex_mmc.c
+++ b/drivers/mmc/host/litex_mmc.c
@@ -411,11 +412,12 @@ static int litex_mmc_probe(struct platform_device *pdev)
         * However, using multi-block transfers results in DMA timeout errors,
         * which is something that requires further investigation!
         */
+#if 0
        mmc->max_blk_count = 1; /* only single-block transfers currently work */
        /* recalculate max_[req,seg]_size given new max_blk_count */
        mmc->max_req_size = mmc->max_blk_size * mmc->max_blk_count;
        mmc->max_seg_size = mmc->max_req_size;
-
+#endif
        host = mmc_priv(mmc);
        host->mmc = mmc;
        host->dev = pdev;

Crash

# [  472.070000] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1H:28]
[  472.070000] CPU: 0 PID: 28 Comm: kworker/0:1H Not tainted 5.11.0-rc3-00014-gb563c3e3c504-dirty #436
[  472.070000] Workqueue: kblockd blk_mq_run_work_fn
[  472.070000] CPU #: 0
[  472.070000]    PC: c0322f0c    SR: 0000807f    SP: c1905c68
[  472.070000] GPR00: 00000000 GPR01: c1905c68 GPR02: c1905c70 GPR03: 00000000
[  472.070000] GPR04: 00000001 GPR05: 00000000 GPR06: 00000000 GPR07: 00000002
[  472.070000] GPR08: 00000001 GPR09: c0323024 GPR10: c1904000 GPR11: 000001f4
[  472.070000] GPR12: 00f98300 GPR13: 00000000 GPR14: 04288000 GPR15: 00000000
[  472.070000] GPR16: bc01801c GPR17: 00000018 GPR18: 000053e3 GPR19: fdefadb9
[  472.070000] GPR20: 00000001 GPR21: 00077c40 GPR22: 00000001 GPR23: 000001e0
[  472.070000] GPR24: c4284270 GPR25: 00000005 GPR26: 00000002 GPR27: 00000005
[  472.070000] GPR28: 00002000 GPR29: 00000000 GPR30: 00000002 GPR31: 00000008
[  472.070000]   RES: 000001f4 oGPR11: ffffffff
[  472.070000] Process kworker/0:1H (pid: 28, stackpage=c1900000)
[  472.070000]
[  472.070000] Stack:
[  472.070000] Call trace:
[  472.070000] [<(ptrval)>] __const_udelay+0xb4/0xc4
[  472.070000] [<(ptrval)>] sdcard_wait_done+0x34/0x94
[  472.070000] [<(ptrval)>] send_cmd+0x8c/0x228
[  472.070000] [<(ptrval)>] litex_request+0x1bc/0x400
[  472.070000] [<(ptrval)>] __mmc_start_request+0x10c/0x170
[  472.070000] [<(ptrval)>] mmc_start_request+0xc8/0xe8
[  472.070000] [<(ptrval)>] mmc_blk_mq_issue_rq+0x7e8/0x8a0
[  472.070000] [<(ptrval)>] ? sbitmap_get+0xe0/0x13c
[  472.070000] [<(ptrval)>] mmc_mq_queue_rq+0x250/0x2ec
[  472.070000] [<(ptrval)>] blk_mq_dispatch_rq_list+0x49c/0x67c
[  472.070000] [<(ptrval)>] ? elv_rb_del+0x74/0x90
[  472.070000] [<(ptrval)>] ? elv_rqhash_del+0x30/0x48
[  472.070000] [<(ptrval)>] blk_mq_do_dispatch_sched+0x2f4/0x30c
[  472.070000] [<(ptrval)>] ? sched_rq_cmp+0x0/0x38
[  472.070000] [<(ptrval)>] ? update_cfs_rq_load_avg+0x1a0/0x1d4
[  472.070000] [<(ptrval)>] __blk_mq_sched_dispatch_requests+0x110/0x198
[  472.070000] [<(ptrval)>] blk_mq_sched_dispatch_requests+0x60/0xac
[  472.070000] [<(ptrval)>] __blk_mq_run_hw_queue+0x98/0xb8
[  472.070000] [<(ptrval)>] blk_mq_run_work_fn+0x2c/0x44
[  472.070000] [<(ptrval)>] process_one_work+0x208/0x2ac
[  472.070000] [<(ptrval)>] worker_thread+0x32c/0x3dc
[  472.070000] [<(ptrval)>] ? worker_thread+0x0/0x3dc
[  472.070000] [<(ptrval)>] kthread+0x130/0x138
[  472.070000] [<(ptrval)>] ? kthread+0x0/0x138
[  472.070000] [<(ptrval)>] ret_from_fork+0x1c/0x9c
enjoy-digital commented 3 years ago

@gsomlo: The linux driver is also probably re-configuring the SDCard, so I'm wondering if the different behaviour could also be related to a different configuration of the SDCard. Doing capture of the SDCard signals would help on this, I'll try to do some soon.

gsomlo commented 3 years ago

So during the first "bad" cmd18, this is never true, not even once.

source.ready is asserted by SDCORE, so it should be true throughout the entire CMD state of SDPHYCMDR.

Which means that source.valid (provided by SDPHYR drops back to 0 after having been asserted long enough to get us out of the WAIT state of SDPHYCMDR.

The question is, why would it do that... :)

gsomlo commented 3 years ago

The linux driver is also probably re-configuring the SDCard

Here's the list of command(argument) values being sent to the SDCard by Linux, all the way through to the first failing cmd18 (it's actually the second overall cmd18 that fails):

[    3.074876]  0: cmd(arg) =   0(0x00000000)
[    3.076771]  1: cmd(arg) =   8(0x000001aa)
[    3.077807]  2: cmd(arg) =  55(0x00000000)
[    3.079627]  3: cmd(arg) =  41(0x00000000)
[    3.080656]  4: cmd(arg) =   0(0x00000000)
[    3.082441]  5: cmd(arg) =   8(0x000001aa)
[    3.084448]  6: cmd(arg) =  55(0x00000000)
[    3.085649]  7: cmd(arg) =  41(0x40300000)
[    3.087543]  8: cmd(arg) =  55(0x00000000)
[    3.088605]  9: cmd(arg) =  41(0x40300000)
[    3.090520] 10: cmd(arg) =  55(0x00000000)
[    3.091632] 11: cmd(arg) =  41(0x40300000)
[    3.093553] 12: cmd(arg) =   2(0x00000000)
[    3.094624] 13: cmd(arg) =   3(0x00000000)
[    3.096440] 14: cmd(arg) =   9(0xd67d0000)
[    3.097463] 15: cmd(arg) =   7(0xd67d0000)
[    3.099271] 16: cmd(arg) =  55(0xd67d0000)
[    3.100245] 17: cmd(arg) =   6(0x00000002)
[    3.102291] 18: cmd(arg) =  55(0xd67d0000)
[    3.103495] 19: cmd(arg) =  51(0x00000000)
[    3.104562] 20: cmd(arg) =  55(0xd67d0000)
[    3.106389] 21: cmd(arg) =  13(0x00000000)
[    3.107511] 22: cmd(arg) =   6(0x00fffff0)
[    3.109358] 23: cmd(arg) =  55(0xd67d0000)
[    3.110555] 24: cmd(arg) =   6(0x00000002)
[    3.112438] 25: cmd(arg) =  18(0x00000000)
[    3.113659] 26: cmd(arg) =  18(0x00000002)   <--- This one is the first to fail with a timeout!

@enjoy-digital: any evidence in that sequence of Linux doing something drastically different from the LiteX bios, in terms of configuring the SDCard?

gsomlo commented 3 years ago

and, in contrast with the timeout errors encountered with multi-block reads (cmd18) in linux, here's a log of the command sequence issued by the LiteX bios while successfully loading bios.bin from the sdcard with the sdcardboot command:

litex> sdcardboot
Booting from SDCard in SD-Mode...
Booting from boot.json...
#=-  0 cmd(arg) =  0(0x00000000)
#=-  1 cmd(arg) =  8(0x000001aa)
#=-  2 cmd(arg) = 55(0x00000000)
#=-  3 cmd(arg) = 41(0x70ff8000)
#=-  4 cmd(arg) = 55(0x00000000)
#=-  5 cmd(arg) = 41(0x70ff8000)
#=-  6 cmd(arg) = 55(0x00000000)
#=-  7 cmd(arg) = 41(0x70ff8000)
#=-  8 cmd(arg) = 55(0x00000000)
#=-  9 cmd(arg) = 41(0x70ff8000)
#=- 10 cmd(arg) = 55(0x00000000)
#=- 11 cmd(arg) = 41(0x70ff8000)
#=- 12 cmd(arg) = 55(0x00000000)
#=- 13 cmd(arg) = 41(0x70ff8000)
#=- 14 cmd(arg) = 55(0x00000000)
#=- 15 cmd(arg) = 41(0x70ff8000)
#=- 16 cmd(arg) = 55(0x00000000)
#=- 17 cmd(arg) = 41(0x70ff8000)
#=- 18 cmd(arg) = 55(0x00000000)
#=- 19 cmd(arg) = 41(0x70ff8000)
#=- 20 cmd(arg) =  2(0x00000000)
#=- 21 cmd(arg) =  3(0x00000000)
#=- 22 cmd(arg) = 10(0xd67d0000)
#=- 23 cmd(arg) =  9(0xd67d0000)
#=- 24 cmd(arg) =  7(0xd67d0000)
#=- 25 cmd(arg) = 55(0xd67d0000)
#=- 26 cmd(arg) =  6(0x00000002)
#=- 27 cmd(arg) =  6(0x80fffff1)
#=- 28 cmd(arg) = 55(0xd67d0000)
#=- 29 cmd(arg) = 51(0x00000000)
#=- 30 cmd(arg) = 16(0x00000200)
Booting from boot.bin...
#=- 31 cmd(arg) =  0(0x00000000)
#=- 32 cmd(arg) =  8(0x000001aa)
#=- 33 cmd(arg) = 55(0x00000000)
#=- 34 cmd(arg) = 41(0x70ff8000)
#=- 35 cmd(arg) = 55(0x00000000)
#=- 36 cmd(arg) = 41(0x70ff8000)
#=- 37 cmd(arg) = 55(0x00000000)
#=- 38 cmd(arg) = 41(0x70ff8000)
#=- 39 cmd(arg) = 55(0x00000000)
#=- 40 cmd(arg) = 41(0x70ff8000)
#=- 41 cmd(arg) = 55(0x00000000)
#=- 42 cmd(arg) = 41(0x70ff8000)
#=- 43 cmd(arg) = 55(0x00000000)
#=- 44 cmd(arg) = 41(0x70ff8000)
#=- 45 cmd(arg) = 55(0x00000000)
#=- 46 cmd(arg) = 41(0x70ff8000)
#=- 47 cmd(arg) = 55(0x00000000)
#=- 48 cmd(arg) = 41(0x70ff8000)
#=- 49 cmd(arg) =  2(0x00000000)
#=- 50 cmd(arg) =  3(0x00000000)
#=- 51 cmd(arg) = 10(0xd67d0000)
#=- 52 cmd(arg) =  9(0xd67d0000)
#=- 53 cmd(arg) =  7(0xd67d0000)
#=- 54 cmd(arg) = 55(0xd67d0000)
#=- 55 cmd(arg) =  6(0x00000002)
#=- 56 cmd(arg) =  6(0x80fffff1)
#=- 57 cmd(arg) = 55(0xd67d0000)
#=- 58 cmd(arg) = 51(0x00000000)
#=- 59 cmd(arg) = 16(0x00000200)
#=- 60 cmd(arg) = 18(0x00000000)
#=- 61 cmd(arg) = 12(0x00000000)
#=- 62 cmd(arg) = 18(0x00000800)
#=- 63 cmd(arg) = 12(0x00000000)
#=- 64 cmd(arg) = 18(0x00001820)
#=- 65 cmd(arg) = 12(0x00000000)
Copying boot.bin to 0x80000000 (15701872 bytes)...
#=- 66 cmd(arg) = 18(0x00001828)
#=- 67 cmd(arg) = 12(0x00000000)
#=- 68 cmd(arg) = 18(0x00000820)
#=- 69 cmd(arg) = 12(0x00000000)
#=- 70 cmd(arg) = 18(0x00001830)
#=- 71 cmd(arg) = 12(0x00000000)
#=- 72 cmd(arg) = 18(0x00001838)
#=- 73 cmd(arg) = 12(0x00000000)
...
gsomlo commented 3 years ago

Two things stick out for me (so far) between successful bios and unsuccessful Linux command(argument) sequences:

  1. Differences in the argument value for ACMD41 (0x70ff8000 is hardcoded in liblitesdcard/sdcard.c in LiteX; no idea so far where/how Linux determines its own argument value):
    acmd41 arg:
    bios:  0x70ff8000
    linux: 0x40300000
            ^ ^----^
            | |
            | + Vdd voltage window
            | 
            +-- bios: HCS/OCR, eSD, XPC
            |
            +--linux: HCS/OCR, n/a, n/a
  2. The bios sdcard_init() routine sends CMD16 (as sdcard_app_set_blocklen(512)); There is no corresponding CMD16 in the Linux trace. Not super clear what the default value would be -- looking into how I'd convince Linux to issue that command explicitly, to test whether it would make a difference.

EDIT1: forcing an additional CMD16 here does NOT seem to help, getting the same timeout.

EDIT2: brute-forcing 0x70ff8000 as the argument to CMD41 also doesn't appear to help.

So, there must be something else that causes the card to "hang" under Linux...

stffrdhrn commented 3 years ago

I have a few days to look at this. One thing I also notice is that the first CMD18 in linux 18(0x00000000) may not be transferring data at all as there is no corresponding stop 12(0x00000000) command.

So, it may be that with the first CMD18 , we are already going wrong.

stffrdhrn commented 3 years ago

I am not making much progress, as I seem to be seeing similar issues in the BIOS when running mor1kx. Transfers are hanging after the first few blocks.

My setup is here: https://github.com/stffrdhrn/or1k-utils/blob/master/litex/litex.build

Instructions, download my setup script and change variables to point to your litex path's if needed. Prereq:

Then run the script:

I have added some debug statements and am seeing.

  m0, b7: |00000000000000000000000000000000| delays: -
  best: m0, b01 delays: 20+-07
  m1, b0: |00000000000000000000000000000000| delays: -
  m1, b1: |00000000000001111111111111110000| delays: 20+-07
  m1, b2: |00000000000000000000000000000111| delays: 30+-01
  m1, b3: |00000000000000000000000000000000| delays: -
  m1, b4: |00000000000000000000000000000000| delays: -
  m1, b5: |00000000000000000000000000000000| delays: -
  m1, b6: |00000000000000000000000000000000| delays: -
  m1, b7: |00000000000000000000000000000000| delays: -
  best: m1, b01 delays: 20+-07
Switching SDRAM to hardware control.
Memtest at 0x000000 (2MiB)...
  Write: 0x0-0x200000 2MiB    
   Read: 0x0-0x200000 2MiB    
Memtest OK
Memspeed at 0x000000 (2MiB)...
  Write speed: 37MiB/s
   Read speed: 25MiB/s

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
Booting from SDCard in SD-Mode...
Booting from boot.json...
  Setting SDCard clk freq to 390 KHz
  Setting SDCard clk freq to 12 MHz
  sdcard init done!
  disk_initialize: sdcardstatus = 0 (0 means ready)
  disk_read: reading block = 0, count = 1
    check_fs: Bad sig got: ffff not aa55                  # fatFS - reading block 0 first time failed 
    find_volume: check_fs result 3
    mount_volume: volume format 3
  sdcardboot_from_json: mount failed! (err: 13)
Booting from boot.bin...
  disk_initialize: sdcardstatus = 0 (0 means ready)
  disk_read: reading block = 0, count = 1
    check_fs: Good sig: aa55                             # fatFS - reading block 0 second time OK
    check_fs: Jumpboot: fa
    find_volume: check_fs result 2
  disk_read: reading block = 63, count = 1               # fatFS - reading block 63 OK
    check_fs: Good sig: aa55
    mount_volume: volume format 0
    mount_volume: complete
  disk_read: reading block = 591, count = 1
Copying boot.bin to 0x00000000 (6020716 bytes)...
[
  disk_read: reading block = 3119, count = 16            # sdcard - reading stops here

Another note, If I add some printfs into sdcard_read the multiblock reads are successful and it cat boot, however the first read of boot.json still fails.

So this seems to show to issues

  1. Multiblock timeout may also be happening in bios on or1k
  2. Bad data, 0xffff, is read when we access block 0 the first time
stffrdhrn commented 3 years ago

Attached the bitstream as mentioned above. It contains the bios with my debug statements as well. I can try to remove jtagbone and the bios patches to make it similar to what you can produce.

mor1kx-sdcard-jtagbone-arty.bit.gz

stffrdhrn commented 3 years ago

I am wondering if my 0xffff issue could be something to do with the sram, maybe software stack corruption or actually an issue with my arty's sram? I doubt there is actually a hardware issue though as even with different bitstreams the same issue occurs, I expect with different bitstreams the sram cells wired in would be different.

For the json read, the FatFS fs->win buffer is at sram address 0x50007940. Writing to that always ends up with in 0xffff in the final sector magic number bytes. For the .bin read, the FatFS fs->win buffer is at sram address 0x50007d54. The writes to that address always are successful.

stffrdhrn commented 3 years ago

I am wondering if my 0xffff issue could be something to do with the sram, maybe software stack corruption or actually an issue with my arty's sram? I doubt there is actually a hardware issue though as even with different bitstreams the same issue occurs, I expect with different bitstreams the sram cells wired in would be different.

This seems wrong. I can see the data being read from the sdcard is showing 0xffff.

From the litescope capture capturing when the DMA is writing that last few words of the block to memory: litescope_cli -v sdblock2mem_wishbonedmawriter_offset_status 125

sdcore is outputting FF and dma is writing FFFFFFFF to sram.

image

Compare this to the first few words of the DMA transfer: litescope_cli -v sdblock2mem_wishbonedmawriter_offset_status 1

sdcore outputs some data, it looks similar to block 0 of the sdcard but doesn't match.

image

This is the content of block one on my sdcard. The below is from the bios command line which works fine. During boot the dma copy seems to be copying incorrect data.

litex> sdcard_read 0
sdcard_read: read multiple b: 0 c: 1 to buf: 50007d54
Memory dump:
0x50007d54  fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0  ................
0x50007d64  fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00  ...|.........!..
0x50007d74  00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75  ....8.u........u
0x50007d84  f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b  .........|...t..
0x50007d94  4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00  L.....|.........
0x50007da4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007db4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007dc4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007dd4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007de4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007df4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e04  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e14  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e24  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e34  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e44  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e54  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e64  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e74  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e84  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007e94  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007ea4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007eb4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007ec4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007ed4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007ee4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007ef4  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x50007f04  00 00 00 00 00 00 00 00 0b 84 7a 60 00 00 80 01  ..........z`....
0x50007f14  01 00 0b fe c2 ff 3f 00 00 00 02 42 0f 00 00 cc  ......?....B....
0x50007f24  04 41 82 a3 05 83 41 42 0f 00 40 42 0f 00 00 b1  .A....AB..@B....
0x50007f34  21 83 83 ec fc ea 00 88 1e 00 00 f0 c9 00 00 00  !...............
0x50007f44  00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa  ..............U.

If I dump the memory in sram where dma copied block 0 we see the data that matches what litescope captured. So it seems my capture is correct. However, the data doesn't match anything in the sdcard, I'll try to capture the PHY, the data so different reminds me of a clock-domain crossing issue.

litex.run $PWD/dump-mem.py 0x50007940 512
Reading memory...
Connected with 127.0.0.1:32974

50007940: 00c88001 80018001 8001c001 80030000
50007950: 01000000 00000000 00000000 00000000
50007960: 00000000 00000000 00000000 00000000
50007970: 00000000 00000000 00000000 00000000
50007980: 4053dfb4 2cffc64c ffffffff ffffffff
50007990: ffffffff ffffffff ffffffff ffffffff
500079a0: ffffffff ffffffff ffffffff ffffffff
500079b0: ffffffff ffffffff ffffffff ffffffff
...
stffrdhrn commented 3 years ago

The multi-block transfer issue in the bios is fixed by putting a bit of sleep between the read command and the polling loop. Here is my patch with debug statements and all. I wonder if trying somethig similar in linux would work.

The 10us sleep is completely arbitrary.

diff --git a/litex/soc/software/liblitesdcard/sdcard.c b/litex/soc/software/liblitesdcard/sdcard.c
index 1cb74b4a..8cb7c683 100644
--- a/litex/soc/software/liblitesdcard/sdcard.c
+++ b/litex/soc/software/liblitesdcard/sdcard.c
@@ -482,11 +491,18 @@ void sdcard_read(uint32_t block, uint32_t count, uint8_t* buf)
 #ifdef SDCARD_CMD23_SUPPORT
        sdcard_set_block_count(count);
 #endif
+       //printf ("sdcard_read: read multiple b: %d c: %d to buf: %lx\n", block, count, (unsigned long) buf);
        sdcard_read_multiple_block(block, count);

+       // Uncommenting this line fixes multiple dma transfer
+       // printf ("sdcard_read: wait dma done\n");
+       // Trying similar with a sleep, this works too
+       busy_wait_us(10);
+
        /* Wait for DMA Writer to complete */
        while ((sdblock2mem_dma_done_read() & 0x1) == 0);

+       //printf ("sdcard_read: stop transition\n");
        sdcard_stop_transmission();

 #ifndef CONFIG_CPU_HAS_DMA_BUS
gsomlo commented 3 years ago

@stffrdhrn: AFAICT, a typical sdcard command happens in three stages:

  1. send comand+argument; wait for cmdevt register to go to 0x01 (any other bit set, and there's an error -- e.g., timeout)
  2. if command expects data to be transferred, also wait for dataevt register to go to 0x01 (same deal as with cmdevt above in terms of additional error bits)
  3. finally, busy-wait for the DMA transfer to complete (either sdblk2mem_done or sdmem2blk_done, depending on whether it's a read or write)

Your fix comes in at step 3, but what I encountered on Linux was an error at step 1 (where cmdevt contained a timeout bit, making anything that might happen afterwards irrelevant). So chances are we might be dealing with different bugs, or different manifestations of some common underlying problem...

stffrdhrn commented 3 years ago

Note, I can also workaround the boot from json issue with the following patch. So for me something seems to fail with the first block read. Adding sleep before the first block read doesn't seem to work.

diff --git a/litex/soc/software/bios/boot.c b/litex/soc/software/bios/boot.c
index 501b8a4d..1f7910b3 100644
--- a/litex/soc/software/bios/boot.c
+++ b/litex/soc/software/bios/boot.c
@@ -758,6 +760,9 @@ void sdcardboot(void)
        /* Boot from boot.json */
        printf("Booting from boot.json...\n");
        sdcardboot_from_json("boot.json");
+       printf("Booting from boot.json...\n");
+       sdcardboot_from_json("boot.json");
+

        /* Boot from boot.bin */
        printf("Booting from boot.bin...\n");
gsomlo commented 3 years ago

I'm still trying to understand the LiteSDCard gateware FSMs, as directed by chasing down the bug I'm encountering on 64-bit rocket Linux (i.e., the command response timeout I get whenever I try using cmd18 multi-block reads).

The symptom, as far as I was able to trace things before getting stuck, is that during the command-read phase, the cmdr module's source's valid signal goes high for precisely a single clock cycle and then never again during the SDPHYCMDR FSM's CMD state until the timeout counter runs out. I tried enforcing that cmdr.source_valid should be asserted for two (or three) cycles before we move to the CMD state, but ended up timing out in the WAIT state instead -- proof that the signal is only asserted once and then goes away.

I'm trying to wrap my head around the intricacies of the stream _UpConverter and PipeValid modules used by SDPHYCMDR.cmdr a.k.a. SDPHYR but, in the mean time, any idea as to why cmdr would assert source.valid for only a single cycle, and then turn it right back off (specifically during a cmd-18, and not some other sdcard command), would probably help a lot toward getting me unstuck...

BTW, I also tried removing the PipeValid (a.k.a. buf portion of the SDPHYR pipeline), and while things sort-of limped along (boot from sdcard in the bios got stuck halfway through), the Linux driver timed out in exactly the same spot as always, so that didn't make any (positive) difference -- from a quick skim through the sources, the PipeValid stage seemed like a timing optimization that I could in theory remove without completely breaking things. I might have gotten that wrong, in which case let's forget I even tried :)

Also, any pointers to whatever documentation was used as the basis for LiteSDCard's behavior would be super useful, as I feel the only useful thing I can do ATM is study them in hope of finding some new thread to proverbially pull at... :)

enjoy-digital commented 3 years ago

Hi @gsomlo, @stffrdhrn,

thanks a lot for looking at this and sorry for not having very reactive. I had a look at it this morning and the issue @stffrdhrn is seeing with boot.json failing and multi-block transfer with the BIOS seems to be caused by setting cmd_done early in https://github.com/enjoy-digital/litesdcard/pull/23. I'm still not sure why this change has a functional effect, but it could be that sdcard_app_send_scr() is returning data at the end of the initialization and we are not waiting it to be received before sending the next command (And I think the SCR data were the first corrupted data you were seeing @stffrdhrn with LiteScope).

I'm going to have another look at this, but for now I already reverted the change, also added some comments to core.py in the code and CSR description with:

@gsomlo: I'll have another look later at your other questions, maybe we should try to work together directly on this to speed up things.

stffrdhrn commented 3 years ago

@enjoy-digital , Thank's I confirm with your fixes and all of my patches removed (current litex upstream) BIOS sdcard boots. I will continue to help with investigating @gsomlo's findings for the multiblock support in the kernel now,

...
Memspeed at 0x000000 (2MiB)...
  Write speed: 37MiB/s
   Read speed: 25MiB/s

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
             Timeout
Booting from SDCard in SD-Mode...
Booting from boot.json...
Copying boot.bin to 0x00000000 (6020716 bytes)...
[########################################]
Copying litex-mor1kx.dtb to 0x007f0000 (1495 bytes)...
[########################################]
Copying rootfs.cpio.gz to 0x00800000 (12396156 bytes)...
[########################################]
Executing booted program at 0x00000000

--============= Liftoff! ===============--
[    0.000000] FDT at (ptrval)
[    0.000000] Linux version 5.12.0-rc2-00003-g316b0cfb6805 (shorne@lianli.shorne-pla.net) (or1k-elf-gcc (GCC) 9.0.1 20190409 (experimental), GNU ld (GNU Binutils) 2.32.51.20190425) #445 Wed Mar 10 13:37:37 JST 2021
[    0.000000] CPU: OpenRISC-10 (revision 0) @100 MHz
...
enjoy-digital commented 3 years ago

Thanks for the test @stffrdhrn, I'll also try to help for the multiblock support. Feel free to continue sharing your findings here, it's very useful to understand the issue and work on it together.

enjoy-digital commented 3 years ago

@gsomlo: I think this has been solved with all the recent improvements? Do you still see limitations on this or should we close?

gsomlo commented 3 years ago

@enjoy-digital @stffrdhrn @paulusmack : as of litesdcard commit 194a58c7fbeb5437ce54e363427ed873c1bc257a, multi-block reads and writes work fine at the full sdclock frequency requested by the linux driver (and I have also removed the additional divider-shift workaround from the driver in the litex-rebase branch). Thanks again to everyone involved -- closing.

stffrdhrn commented 3 years ago

Thanks very much!