ARMmbed / DAPLink

https://daplink.io
Apache License 2.0
2.3k stars 970 forks source link

Flashing errors with recent Windows update #1025

Open microbit-carlos opened 1 year ago

microbit-carlos commented 1 year ago

A recent Windows 10 and Windows 11 update has started triggering checksum and time our errors on DAPLink.

This has been reported by micro:bit & Calliope users, and we have been able to replicate in Windows 10 and 11 when the OS is kept up-to-date. We haven't tried Windows 8.1, but that has reached end of life last January.

Triggering Windows Update

The cumulative updates have been found and installed using this Microsoft catalogue: https://www.catalog.update.microsoft.com/Search.aspx?q=Cumulative%20Update%20Windows%2011%2022H2%20x64

Windows 11 22H2

I went through installing and uninstalling cumulative updates, and in my findings the problem is triggered when installing 2023-02 Cumulative Update Preview for Windows 11 Version 22H2 for x64-based Systems (KB5022913) from the 28th of February, which updates Windows 11 22H2 to OS Build 22621.1344.

The previous cumulative update KB5022845 from the 14th of Feb (OS Build 22621.1265) doesn't trigger this issue.

Windows 11 21H2

The Microsoft update catalog doesn't show any updates for Win 11 1 21H2 since November 2022, so I won't bother to test this Windows version.

Windows 10 22H2

The issue was triggered for me using 2023-03 Cumulative Update Preview for Windows 10 Version 22H2 for x64-based Systems (KB5023773) from the 21st of March, which updates the OS to build 1904x.2788.

The previous cumulative update KB5023696 from the 14th of March (OS Build 1904x.2728) doesn't trigger this issue.

Windows 10 21H2

We've also been able to replicate this issue in Win 10 21H2, and Microsoft is still releasing updates for this OS version, so it makes sense that we could identify a specific cumulative update to introduce this issue. I probably won't be looking into this one, nor Win 10 20H2 as it's unlikely to provide any additional useful information.

Failure modes

Note It's worth mentioning that the micro:bit V2 port contains an additional feature where if DAPLink encounters an error, it will reflash the target with a custom small programme that scrolls the error code in the micro:bit LED matrix display. This is relevant because in some occasions this error programme is not flashed.

We've encountered a few different ways in which errors emerge:

The errors are not triggered on every flash, but different users have reported different error frequencies. In our internal testing some teammates measured 20% failure rate and others up to 60%. Some users have reported errors happenning on "almost every flash".

We've used micro:bit Universal Hex files for the majority of these tests, which are a bit more resilient to this issue (more info in the "Identifying the Cause" section), so other DAPLink users flashing Intel Hex files might encounter this problem more often (it's also likely that the micro:bit user that reported an error on "almost every flash" was using Intel Hex files as well).

screenshot1
screenshot2 image
screenshot3 image
Assert
File: ../../../source/daplink/drag-n-drop/vfs_manager.c
Line: 361
Source: Application
Hexdumps
fffffff1
20000fc0
20005e88
00000000
20005eac
00000000
00000000
1fffeb9c
Assert
File: ../../../source/daplink/drag-n-drop/vfs_manager.c
Line: 361
Source: Application
Hexdumps
fffffff1
20000fc0
20005e88
00000000
20005eac
00000000
00000000
1fffeb9c

Identifying the Cause

I’ve collected a couple of RTT logs from DAPLink with additional debug prints to track how the OS writes the file blocks to disk, and peaking at the actual data. While it’s still a bit early (I need more time to capture more data and analyse it), initial findings point at the problem being caused by file blocks being sent out of order by the OS.

In previous Windows versions, the file blocks are sent in order, but after the listed Windows updates are installed it looks like some file blocks are first sent as zeros, and then later down the file transfer the blocks are sent again with the real file data.

For example:

And this can happen more than once on the same file transfer.

However, not every file transfer sends files out of order, some are sent in order and it all works fine.

The check sum errors are encountered when the OS sends a block filled with zeros and DAPLink tries to calculate the checksum of an Intel Hex record. I still need to capture a better log for timeout errors, but I believe those are usually triggered when out of order blocks are ignored by DAPLink and then when the OS has finished sending the file, then DAPLink waits for more data to arrive (as the ignored blocks are not counted when measuring how much file data was transferred) until it eventually times out.

For the micro:bit specifically we use Universal Hex files, a superset of the Intel Hex format, which contains data for micro:bit V1 and micro:bit V2 in the same file. In file transfers where the out-of-order blocks correspond only for a section of the Universal Hex file that is not relevant the target MCU being flashed, the flash can still be successful. So while I haven't yet compared failure rates of Intel vs Universal Hex, it's very likely Intel Hex (and bin) files fail more frequently.

A checksum error log and Universal Hex file can be found here:

(Also note that because there is a lot of log data captured, data is sometimes dropped, so it might look like some blocks are not being sent, but we can look at the variables tracking the file size transferred to confirm that data has been processed, it's just that the RTT buffer was likely full).

Workarounds

Using robocopy with the /z flag, for restartable mode, seems to be work so far.

For example, with the terminal at the path where your file.hex is located, and assuming DAPLink is mounted as drive E:\:

robocopy /z . E:\ file.hex

Also, WebUSB flashing works, so for Intel Hex and bin files this demo from DAPJs can still flash the boards: https://armmbed.github.io/dapjs/examples/daplink-flash/web.html

For micro:bit Universal Hex files, with online WebUSB tool will work too: https://microbit.org/tools/webusb-hex-flashing/

mathias-arm commented 1 year ago

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary.

It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

free2create commented 1 year ago

@microbit-carlos In case this is related. I am seeing this error when using the microbit on Ubuntu 23.04. It initially works a few times then the timeouts, 503, start happening. I haven't tried stopping/starting the USB bus yet since one bus impacts the keyboard and the other my wireless. But I could dig in deeper.

@microbit-carlos FYI: For same hardware I booted into Windows 10 and it worked every time. This sort of feels like USB emulation is incomplete so could this bug be on microbit side ? By that I mean is that after file is dropped into microbit the USB connection seems to be reset so users have another go at flashing again. This USB reset process may be faulty and some required DAPLink API calls are not made, but should have been.

ozersa commented 1 year ago

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary. It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

@mathias-arm This is a critical issue for us, we will appreciate if you provide an estimation this issue, when can it be fixed?

microbit-carlos commented 1 year ago

We had an update from Microsoft that they expect to release a fix in the September Windows update 🎉

fesc-q commented 1 year ago

Tested with Windows 11 22H2 22621.2283 September 12th 2023 build. Issue is still reproduced

microbit-carlos commented 1 year ago

Yes, it looks like the update has been pushed for October, hopefully it'll be finally be out by then.

fesc-q commented 11 months ago

From Microsoft

checked internally the update is released to fix the issue already this week.

I Tested the Windows 11 22H2 22621.2428 Oct 10th 2023 build. Issue is still reproduced on an old DAPLink release and a relatively new J-Link OB release

top-5 commented 11 months ago

I can still consistently repro this bug on Win 11 22H2 23560.1000 insider preview. DAPLink Build ID: v0257-gc782a5ba Doesn't seem like any recent updates would fix much yet.

selimgullulu commented 11 months ago

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary. It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

@mathias-arm This is a critical issue for us, we will appreciate if you provide an estimation this issue, when can it be fixed?

Issue is still reproduced while flashing fw to the MAX32625PICO

microbit-carlos commented 11 months ago

This should be fix with Windows 11 build 22621.2506, released on the 31st of October. https://support.microsoft.com/en-gb/topic/october-31-2023-kb5031455-os-builds-22621-2506-and-22631-2506-preview-6513c5ec-c5a2-4aaf-97f5-44c13d29e0d4

I've tested this build with a BBC micro:bit with DAPLink 0257 and could not replicate the issue anymore.

@felix-qorvo @top-5 @selimgullulu could you update this this version and try again? Thanks!

selimgullulu commented 11 months ago

This should be fix with Windows 11 build 22621.2506, released on the 31st of October. https://support.microsoft.com/en-gb/topic/october-31-2023-kb5031455-os-builds-22621-2506-and-22631-2506-preview-6513c5ec-c5a2-4aaf-97f5-44c13d29e0d4

I've tested this build with a BBC micro:bit with DAPLink 0257 and could not replicate the issue anymore.

@felix-qorvo @top-5 @selimgullulu could you update this this version and try again? Thanks!

Hi @microbit-carlos , is there an equivalent update for Windows 10? This seems to be for Windows 11. Thanks Selim

microbit-carlos commented 11 months ago

I don't know, sorry. Do you have the latest cumulative update installed? (probably KB5031445) And it still has issue there?

fesc-q commented 10 months ago

Verified to be fixed Fixed in Windows updates: Win 10: https://support.microsoft.com/en-us/topic/october-26-2023-kb5031445-os-build-19045-3636-preview-03f350cb-57f9-45e6-bfd7-438895d3c7fa Win 11: https://support.microsoft.com/en-us/topic/october-31-2023-kb5031455-os-builds-22621-2506-and-22631-2506-preview-6513c5ec-c5a2-4aaf-97f5-44c13d29e0d4

selimgullulu commented 10 months ago

Hi, I used the KB5031445 on two different Windows 10 Laptops (Surface & Dell) and the drag&drop success rate was 100%. I'm waiting confirmation from some colleagues about the resolution. In the meantime, can you please let me know if your PCs are also using encryption for storage? Has anyone experienced this problem on a PC withOUT encryption? Thanks.