dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.29k stars 1.59k forks source link

Non-deterministic snapshots on windows x64 #56884

Open athomas opened 1 month ago

athomas commented 1 month ago

https://ci.chromium.org/ui/p/dart-internal/builders/ci/dart-sdk-win-beta/165/overview exposed non-deterministic Windows x64 snapshots (strangely, ia32 snapshots seemed to be deterministic).

Non-deterministic snapshots:

https://chrome-infra-packages.appspot.com/p/flutter/dart-sdk/windows-amd64/+/git_revision:d916a5f69a486de98316900f19ef0ff46834b03d https://storage.cloud.google.com/dart-archive/channels/beta/raw/hash/d916a5f69a486de98316900f19ef0ff46834b03d/sdk/dartsdk-windows-x64-release.zip

@rmacnak-google any ideas?

athomas commented 1 month ago

Also, it's not all snapshots in the SDK, just some.

rmacnak-google commented 1 month ago

IA32 is special because it doesn't support AppJIT (for the same reason it doesn't support AOT: our IA32 code isn't relocatable). So this is probably non-determinism in the AppJIT training. It could be an issue in the VM, or it could be that the training programs are non-deterministic.

a-siva commented 1 month ago

@athomas how critical is it for these snapshots to be deterministic? The training run variation potentially leads to this non determinism, doing a training run with just--help might fix it but that would not be ideal. We are also working towards switching all these snapshots to AOT snapshots and maybe that is the right fix for this.

athomas commented 1 month ago

If we think we'll have the AOT snapshots in a reasonable timeframe, then I'd rather we go for that. I don't know how frequently this will still happen in the release process (I implemented some retries to mitigate this failure mode) and there is a workaround (bump the version, create a new release).

rmacnak-google commented 1 month ago

This reproduces on Linux.

rmacnak-google commented 1 month ago

Now I only observe non-determinism for the analysis server snapshot. I see that during its training run, it uses timers, which would cause non-determinism.