atarijookie / ce-atari

Official repository for Cosmos Ex software
http://joo.kie.sk/?page_id=384
34 stars 9 forks source link

SD card data corruption #100

Open atarijookie opened 7 years ago

atarijookie commented 7 years ago

Reported by two users, that copying of 1 MB files results in the files being damaged. One mentioned that he was copying from CE shared drive, so in the end it might also be bad read from CE translated drive or something...

Steps to reproduce:

Todo:

If the issue would be in 1) or 2), it would be the easiest thing to fix. If it would be 3) or 4), now that would be harder....

It happened also on the 15 MB partition, although it was fine for first 3 files and happened on 4th file (2.9 MB file size).

Happened on the 1st copied file on 250 MB BGM partition - might be related to sector # being higher, count of sectors written at one time being higher (with larger cluster size on larger partition), or write to SD card taking longer on these higher sectors and not waiting enough...

atarijookie commented 7 years ago

The simple write / verify test has been added to CE_TSTHD.PRG, will be tested on CE later.

If the issue can be reproduced with this test, it would be nice to run it without CE_DD present - to check if the issue isn't caused by CE_DD itself... (e.g. screwing with the Rwabs() routine due to Pexec() code).

atarijookie commented 7 years ago

Using the new test thing I now can see that:

atarijookie commented 7 years ago

TODO:

atarijookie commented 7 years ago

CE knows that error happened, ST then does REQUEST SENSE but don't show any warning: 4a 00 7b 02 10 00 - 02 (CMD: write, status byte: check condition) 43 00 00 00 12 00 - 00 (CMD: request sense)

Funny thing - if you write the same file over and over, the issue disappears, so you have to write the empty partition image on SD card when trying to reproduce error.

atarijookie commented 7 years ago

In one 2MB write these failed: 4a 00 7a f2 10 00 - sector 0x7af2, len 0x10 - status byte 02, bridge status E_OK, sense code 03 4a 00 80 02 10 00 - sector 0x8002, len 0x10 - status byte 02, bridge status E_OK, sense code 03 4a 00 79 02 10 00 - sector 0x7902, len 0x10 - status byte ff, bridge status E_OK

This is very repeatable - with same state of SD card (after image writing) even the sector numbers, where this happen, match exactly...

not failing on mmc_datatransfer: 214, 229, 245 but failing on mmc_datatransfer: 273 ! -- timeout on WAIT_FOR_CARD_NOT_BUSY ?

It's a timeout after WAIT_FOR_CARD_NOT_BUSY, and it's not caused by the tim3 setting / reseting / weird behavior.

This happens during sector 0x04 out of 0x10, so it's probably not caused by previous SCSI write operation, but due to this one.

Making the timeout 1 second instead of 0.5 s doesn't help - it will still get stuck in timeout.

Is there some data byte missing when doing spi2Dma_txRx()? (on start of pSend, on end of pSend buffer?)

Seems like it's missing some bytes due to bad DMA/SPI start/stop, but just adding 4 bytes before MMC_STARTBLOCK_MWRITE doesn't help.

atarijookie commented 7 years ago

Hypotheses:

atarijookie commented 7 years ago
atarijookie commented 7 years ago

note: in solo mode the card insertion causes solo mode coutdown - did this insertion restart the chip, or it just triggers the code which does the solo mode detection?

atarijookie commented 7 years ago

https://members.sdcard.org/downloads/pls/simplified_specs/part1_410.pdf

Some cards may require long and unpredictable times to write a block of data

Setting a number of write blocks to be pre-erased (ACMD23) will make a following Multiple Block Write operation faster compared to the same operation without preceding ACMD23. It is recommended using this command preceding CMD25, some of the cards will be faster for Multiple Write Blocks operation.

... the card can indicate write busy up to 500ms including single and multiple block write...

The host should use a fixed timeout for write operations rather than using a timeout calculated from the R2W_FACTOR parameter. It is strongly recommended for hosts to implement more than 500ms timeout value even if the card indicates the 250ms maximum busy length. Even if the card supports Speed Class, any multiple block write operation may indicate a busy period of up to a maximum of 250ms. The sum of the busy periods over an AU is limited by Speed Class.

atarijookie commented 7 years ago

Timeout stretched to 3 seconds, sending ACMD23 before multiple block write operation.

ACSI test altered, showing letters A to Z for how long the write of 128 kB took:

The SCSI log in Hans FW shows one error on write...

Turned off ACMD23, no write performance difference - am I doing something wrong when sending this cmd?

atarijookie commented 7 years ago

New firmware with extended timeout released, waiting for user response... Info page about the issue and test written: http://joo.kie.sk/?page_id=778

atarijookie commented 7 years ago

Add new type of test - do alternating read / write from different SCSI IDs - CE RPi vs CE SD, and alternating read write from CE RPi vs GEMDOS partition (e.g. using HDDRIVER).

atarijookie commented 7 years ago

So user reports that it's better for smaller files, but on even larger files (9 MB) it begins to fail once again. The only good thing there is that the read/write error count from Hans keeps getting incremented, so at least we know that Hans knows that the operation failed.

Currently recommended to try other SD cards which might behave better, or use translated drive for large data.

atarijookie commented 7 years ago

There's and idea to do write buffering through RPi, although that would be a lot of work... http://www.atari-forum.com/viewtopic.php?f=103&t=30390&p=309113#p309113

atarijookie commented 7 years ago

On TT+ACSI+ICD Pro, tried:

To try:

atarijookie commented 7 years ago

TT+ACSI+HDDRIVER 9.00:

Same results with TT+SCSI+HDDRIVER 9.00 - everything fine.

Falcon+SCSI+HDDRIVER 9.00:

atarijookie commented 7 years ago

When copying single 21 MB file, got 5 read errors, all are: mmc_datatransfer.c:36 -- LOG_ERROR(10);

This is the problematic command:

cmd   : 1f 28 00 00 00 73 a8 00 6d f0 00 
cmdLen: 11

starting sector: 00 00 73 a8
sector count   : 6d f0 (28144)
transfer size  : 13.7 MB

SysInfo reports:

Total ST-RAM: 14336 kB
Free ST-RAM: 13988 kB
Total TT-RAM: 0 kB

So the question now is - on Falcon does HDDRIVER: A) alloc as much free RAM as needed, then does large reads and large writes (e.g. 14072 kB, but that wouldn't fit in 13988 kB of free RAM, or frees some RAM and then allocs just that large buffer, so it would read exactly that), or...

B) use some (currently not supported) SCSI mechanism for issuing large read, which it can interrupt, then do large write which it can interrupt, or does more small writes (uninterrupted) and then does switch between read and write(s) without issuing another read...

This sounds like a question to Uwe S. and also requires some looking into SCSI-2 spec.

After simple test of making the timeout a value of number_of_mb * 3_seconds_per_mb it still fails, so either HDDRIVER has some maximum timeout which this overflows, or it's the SCSI mechanism...

If it's timeout in HDDRIVER: Can you set this timeout in HDDRUTIL? Is it fixed? At what value? If it's SCSI mechanism: how does it work? :-D

It still could be the limit of timer of Hans, where 72 MHz prescaled by 65k (to 1098 Hz), with 3 s per MB, results in maximum 19 MB transfer - that's 62586 timer ticks; the 20 MB transfer would result in 65880 timer ticks, and that's over 65535 maximum. So either prescale this timer to something less than 1098 Hz, or use different clock source, or redo the simple timeout mechanism... There are no 32 bit times on Hans?

miniupnp commented 7 years ago

A) I guess sysinfo reports free ST-RAM once it is loaded. So Free ST-RAM before it is loaded may be 14072kB. And then TOS (or HDDRIVER) use all free ST-RAM when copying files.. To be confirmed by Uwe Seimet

atarijookie commented 7 years ago

So HDDRIVER / TOS tries to do 13.77 MB data transfer, and each time it fails (time-outs) at around 11.3 seconds from the start. It looks like there's 10 second timeout in HDDRIVER as maximum for each transfer (+1 second of timeout in Hans, that's why it's 11, not 10 total). Will contact Uwe.

atarijookie commented 7 years ago

Mail to Uwe:

Hello Uwe,

I'm having this issue with my device (CosmosEx) and HDDRIVER on Falcon, and I would like your opinion on this :)

TT has 4 MB ST RAM, and 16 MB TT RAM. Falcon has 14 MB ST RAM, and no TT RAM.

The issue appears when copying large file from partition on SD card - 21 MB file. This issue doesn't appear on ST and TT through ACSI, and it didn't appear on my TT through SCSI either, as the transfers are not that big.

The issue appears when copying that large file on Falcon, when it tries to do SCSI Read(10) command, with transfer length of 0x6e38 (28216 decimal) sectors, which is roughly 13.77 MB of data. The operation on SD card takes probably too long (as the read speed is around 1.0 MB/s, in debug mode around 0.6 MB/s), and the whole read operation times-out at around 11 seconds from its start (Falcon stops receiving data from the device).

It seems that newer TOS tries to utilize as much ST RAM as there is free for the transfers, this results in that 13.77 MB transfer, which takes long and times out. This for sure isn't the case for ST where there's not that much ST RAM, but it seems it's happening on Falcon, and maybe it would happen on TT if it had more ST RAM.

So my questions are now:

  • do you think this is what's happening?
  • does HDDRIVER 9.0 have a 10 second timeout on READ (and / or WRITE) operations?
  • can user set longer timeout in HDDRUTIL?
  • can user set maximum sector count on these operations in HDDRUTIL?

Regards, Miro Nohaj (Jookie)

atarijookie commented 7 years ago

Adaptive timeout done in 4e3dad9

mikrosk commented 7 years ago

How come this doesn't happen on TOS with regular medias like an SD/CF card on Falcon?

miniupnp commented 7 years ago

@mikrosk IDE is different. Maybe no SCSI Hard drives is slow enough

miniupnp commented 7 years ago

and maybe it DOES happen

atarijookie commented 7 years ago

HDDRIVER 8.49

In    tpSCSICmd $5522
  Handle $3a22a
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6D:18:00
  TransferLen 14299136  Buffer $54390
  SenseBuffer $39aae  Timeout 24000
  Flags $00
-> STATUSERROR
In    tpSCSICmd $5518
  Handle $3a22a
  CmdLen 6  Cmd $1B:00:00:00:01:00
  TransferLen 0  Buffer $0
  SenseBuffer $39aae  Timeout 24000
  Flags $00
-> CHECK CONDITION  SenseKey $05  ASC $20  ASCQ $00
In    tpSCSICmd $5522
  Handle $3a22a
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6D:18:00
  TransferLen 14299136  Buffer $54390
  SenseBuffer $39aae  Timeout 24000
  Flags $00
-> STATUSERROR

HDDRIVER 9.00:

In    tpSCSICmd $5522
  Handle $38782
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6D:68:00
  TransferLen 14340096  Buffer $4b0b4
  SenseBuffer $38102  Timeout 24000
  Flags $00
-> STATUSERROR
In    tpSCSICmd $5518
  Handle $38782
  CmdLen 6  Cmd $1B:00:00:00:01:00
  TransferLen 0  Buffer $0
  SenseBuffer $38102  Timeout 24000
  Flags $00
-> CHECK CONDITION  SenseKey $05  ASC $20  ASCQ $00
In    tpSCSICmd $5522
  Handle $38782
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6D:68:00
  TransferLen 14340096  Buffer $4b0b4
  SenseBuffer $38102  Timeout 24000
  Flags $00
-> STATUSERROR

HDDRIVER 10.01:

In    tpSCSICmd $5522
  Handle $3297e
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6D:70:00
  TransferLen 14344192  Buffer $4949c
  SenseBuffer $32372  Timeout 24000
  Flags $00
-> STATUSERROR
In    tpSCSICmd $5518
  Handle $3297e
  CmdLen 6  Cmd $1B:00:00:00:01:00
  TransferLen 0  Buffer $0
  SenseBuffer $32372  Timeout 24000
  Flags $00
-> CHECK CONDITION  SenseKey $05  ASC $20  ASCQ $00

HDDRIVER 10.02b:

In    tpSCSICmd $5522
  Handle $13bbe
  CmdLen 10  Cmd $28:00:00:00:73:8A:00:6C:F8:00
  TransferLen 14282752  Buffer $58e1c
  SenseBuffer $135d0  Timeout 24000
  Flags $00
-> STATUSERROR
In    tpSCSICmd $5518
  Handle $13bbe
  CmdLen 6  Cmd $1B:00:00:00:01:00
  TransferLen 0  Buffer $0
  SenseBuffer $135d0  Timeout 24000
  Flags $00
-> CHECK CONDITION  SenseKey $05  ASC $20  ASCQ $00
atarijookie commented 7 years ago

As expected, RAW media on USB suffer from the same issue (or one of them)...

More info on the RAW issue in #152