Open samuel-williams-da opened 1 year ago
Please add any flaky or failing tests here :)
Check whether these tests are using sandbox on x, as they will need to be migrated.
//daml-lf/interpreter:tests_test_suite_src_test_scala_com_digitalasset_daml_lf_speedy_SBuiltinBigNumericTest.scala Timed out on me https://dev.azure.com/digitalasset/daml/_build/results?buildId=137293&view=logs&j=a5e52b91-c83f-5429-4a68-c246fc63a4f7&t=d4864165-4be3-5e34-b483-a6b05303aa68&l=6187
While googling around wrt the "Stream 9 sent too many headers!" flake, I uncovered the following related looking links(?):
In addition, the stack overflow link points to:
Google search terms: akka sendHeaders has already been called
I also found this one
I'm looking into the source now, as I've been hitting it a lot more recently
For convenience of not losing, heres a cleaned up stack trace from one of the too many headers flakes
I believe the issue is here
As both the Future
and onComplete
for the result can call onNext
. I've confirmed that the CI tests do run in live mode, so an unfortunately placed random status message could clash with the result and trigger the error.
I'd assume this is most common on slightly longer running tests, given we only get a status message every 300-600ms. Going to make a quick fix and run a bunch of times to see if I'm right
Stream 9 sent too many headers!
fix confirmed with PR #16659 , removed from the table.
We have recently had some reports of failures related to //triggers/service:test-oracle_test_suite_src_test_suite_scala_com_daml_lf_engine_trigger_TriggerServiceTestWithOracle.scala
.
Modifying the ci/build.yml
to run the test 5 times allowed this failure to be observed on the main branch (see https://github.com/digital-asset/daml/pull/16877). This confirms that this test is flaky (as it often ow succeeds).
Remy has recently relayed a potential flaky test on Windows. Typical symptom is:
//triggers/service:test_6 FAILED in 117.3s
C:/users/u/_bazel_u/vvgx3zjt/execroot/com_github_digital_asset_daml/bazel-out/x64_windows-opt/testlogs/triggers/service/test_6/test.log
Found //compiler/damlc/tests:repl-functests
being flaky once on https://github.com/digital-asset/daml/pull/16984
@nickchapman-da mentioned the following in weekly team meeting:
//daml-lf/interpreter:tests_test_suite_src_test_scala_com_digitalasset_daml_lf_speedy_SBuiltinBigNumericTest.scala TIMEOUT in 60.1s
Given the discussion I would probably just up the timing for the SBuiltinBigNumericTest.
Gary: ignore M1 tests for now given the completely unrelated flakes Sam: Windows test needs changes to a Shake rule so we can properly propagate.
FAIL: //compiler/damlc/tests:daml-doctest (see C:/users/u/_bazel_u/lwsr2dl2/execroot/com_github_digital_asset_daml/bazel-out/x64_windows-opt/testlogs/compiler/damlc/tests/daml-doctest/test.log)
INFO: From Testing //compiler/damlc/tests:daml-doctest:
==================== Test output for //compiler/damlc/tests:daml-doctest:
daml-doctest
generate doctest module
empty module: OK (0.13s)
example in doc comment: FAIL
Exception: user error (Pattern match failure in do expression at compiler\damlc\tests\src\DA\Test\DamlDocTest.hs:100:17-23)
Use -p '/example in doc comment/' to rerun this test only.
example in nondoc comment: OK (0.12s)
multiple examples in one comment: OK (0.18s)
example in code block: OK (0.16s)
multiline result: OK (0.10s)
I've increased the timeout for SBuiltinBigNumericTest (https://github.com/digital-asset/daml/pull/17107) and I've adding diagnostics to the windows damldoctest failure (https://github.com/digital-asset/daml/pull/17111)
https://dev.azure.com/digitalasset/daml/_build/results?buildId=145708&view=logs&jobId=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=024164e1-ccd8-5d04-bea4-1ac3c8885917
Flaky on Linux (!): //libs-haskell/da-hs-base:da-hs-base-telemetry-tests
^ Minimal logs on circle CI, all we know is error code 139, which is a seg fault. Cannot replicate on my machine. It's unclear if the segfault would stop existing logs from being flushed - we get nothing from Tasty, so it could be the calls to:
withTempDir
setPermissions
withEnv
I'll explore those first, but otherwise - we're a little in the dark here.
Rémy:
//ledger-service/http-json:integration-tests-ce_test_suite_src_it_scala_http_WebsocketServiceIntegrationTest.scala
Sometimes the participant errors randomly with canton fixture with participant overloaded
. we might be able to either increase this (but likely not in CE) or throttle our requests to canton via canton fixture
Or consider using enterprise in some more tests
New flaky test in ActionTest.daml
. Running tests for https://github.com/digital-asset/daml/pull/17609/commits/98a444962b889e9e52d7f9025cf7ebab343f6f67 in https://dev.azure.com/digitalasset/adadc18a-d7df-446a-aacb-86042c1619c6/_apis/build/builds/152277/logs/161 fails, but running bazel run //compiler/damlc/tests:integration-v2dev -- -p '/ActionTest/'
locally succeeds.
I've downloaded the log, I'll DM it on request.
Relevant parts of the log:
2023-10-24T13:10:01.3200688Z ActionTest.daml
2023-10-24T13:10:01.3201026Z Build log: OK
2023-10-24T13:10:01.3201447Z Time: DLint = 0.03s
2023-10-24T13:10:01.3201700Z Time: LF convert = 0.38s
2023-10-24T13:10:01.3202054Z Time: LF pretty-printing = 0.04s
2023-10-24T13:10:01.3202369Z Time: GHC compile = 0.01s
2023-10-24T13:10:01.3202680Z Time: Core pretty-printing = 0.06s
2023-10-24T13:10:01.3203010Z Time: LF type check = 0.00s
2023-10-24T13:10:01.3203331Z Time: LF saving = 0.09s
2023-10-24T13:10:01.3203646Z Time: LF scripts execution = 3.59s
2023-10-24T13:10:01.3203984Z Time: JSON saving = 0.35s
2023-10-24T13:10:01.3204435Z Check diagnostics: FAIL
2023-10-24T13:10:01.3204906Z File:
2023-10-24T13:10:01.3206782Z /home/vsts/.cache/bazel/_bazel_vsts/9b01b58e95c1fb4c476c067a6a807d1d/sandbox/linux-sandbox/9444/execroot/com_github_digital_asset_daml/bazel-out/k8-opt/bin/compiler/damlc/tests/integration-v2dev.runfiles/com_github_digital_asset_daml/compiler/damlc/tests/daml-test-files/ActionTest.daml
2023-10-24T13:10:01.3208487Z Hidden: no
2023-10-24T13:10:01.3208831Z Range: 18:1-18:12
2023-10-24T13:10:01.3209116Z Source: Script
2023-10-24T13:10:01.3209389Z Severity: DsError
2023-10-24T13:10:01.3209649Z Message:
2023-10-24T13:10:01.3209899Z Script execution failed:
2023-10-24T13:10:01.3210229Z Evaluation timed out after 3 seconds
2023-10-24T13:10:01.3210575Z
2023-10-24T13:10:01.3210865Z Ledger time: 1970-01-01T00:00:00Z
2023-10-24T13:10:01.3211245Z Wrong number of diagnostics, expected 0, but got 1
2023-10-24T13:10:01.3211628Z
2023-10-24T13:10:01.3212091Z Use -p '/ActionTest.daml.Check diagnostics/' to rerun this test only.
2023-10-24T13:10:01.3212604Z AliasCompression.daml
//daml-assistant:test
appeared to be flaky, but it might just be a network issue. Worked on Linux, not on MacOS. Either way, I've fixed the test to no longer connect to the internet. Noted here for future historians ;)
2023-11-13T14:16:38.4306830Z FAIL: //daml-assistant:test (see /private/var/tmp/_bazel_vsts/9969e26d01e3c239b6c915f90e435c1a/execroot/com_github_digital_asset_daml/bazel-out/darwin-opt/testlogs/daml-assistant/test/test.log)
2023-11-13T14:16:38.4321220Z INFO: From Testing //daml-assistant:test:
2023-11-13T14:16:38.4328340Z ==================== Test output for //daml-assistant:test:
2023-11-13T14:16:38.4333050Z DA.Daml.Assistant
2023-11-13T14:16:38.4333860Z DA.Daml.Project.Util.ascendants
2023-11-13T14:16:38.4335280Z unit tests: OK
2023-11-13T14:16:38.4336310Z ascendants is nonempty: OK
2023-11-13T14:16:38.4338030Z +++ OK, passed 100 tests.
2023-11-13T14:16:38.4339290Z head . ascendants == id: OK
2023-11-13T14:16:38.4340650Z +++ OK, passed 100 tests; 20 discarded.
2023-11-13T14:16:38.4341840Z head . ascendants == id (2): OK
2023-11-13T14:16:38.4345580Z +++ OK, passed 100 tests; 25 discarded.
2023-11-13T14:16:38.4347100Z tail . ascendants == ascendants . takeDirectory: OK
2023-11-13T14:16:38.4348860Z +++ OK, passed 100 tests; 34 discarded.
2023-11-13T14:16:38.4349840Z DA.Daml.Assistant.Env.getDamlPath
2023-11-13T14:16:38.4350970Z getDamlPath returns DAML_HOME: OK
2023-11-13T14:16:38.4352450Z getDamlPath returns DAML_HOME (made absolute): OK
2023-11-13T14:16:38.4353900Z posix-specific tests
2023-11-13T14:16:38.4354900Z getDamlPath gets app user data directory by default: OK
2023-11-13T14:16:38.4356160Z DA.Daml.Assistant.Env.getProjectPath
2023-11-13T14:16:38.4357310Z getProjectPath returns environment variable: OK
2023-11-13T14:16:38.4358880Z getProjectPath returns environment variable (made absolute): OK
2023-11-13T14:16:38.4360400Z getProjectPath returns nothing: OK
2023-11-13T14:16:38.4361880Z getProjectPath returns current directory: OK
2023-11-13T14:16:38.4363380Z getProjectPath returns parent directory: OK
2023-11-13T14:16:38.4364920Z getProjectPath returns grandparent directory: OK
2023-11-13T14:16:38.4366470Z getProjectPath prefers parent over grandparent: OK
2023-11-13T14:16:38.4367720Z DA.Daml.Assistant.Env.getSdk
2023-11-13T14:16:38.4368810Z getSdk returns DAML_SDK_VERSION and DAML_SDK: OK
2023-11-13T14:16:38.4370320Z getSdk determines DAML_SDK from DAML_SDK_VERSION: OK
2023-11-13T14:16:38.4371920Z getSdk determines DAML_SDK_VERSION from DAML_SDK: OK
2023-11-13T14:16:38.4373480Z getSdk determines DAML_SDK and DAML_SDK_VERSION from project config: OK
2023-11-13T14:16:38.4375120Z getSdk: DAML_SDK overrides project config version: OK
2023-11-13T14:16:38.4376660Z getSdk: DAML_SDK_VERSION overrides project config version: OK
2023-11-13T14:16:38.4378190Z getSdk: Returns Nothings if .daml/sdk is missing.: OK
2023-11-13T14:16:38.4379830Z DA.Daml.Assistant.Env.getDispatchEnv
2023-11-13T14:16:38.4380990Z getDispatchEnv should be idempotent: OK
2023-11-13T14:16:38.4382500Z getDispatchEnv should override getDamlEnv: FAIL
2023-11-13T14:16:38.4391670Z Exception: AssistantError {errContext = Nothing, errMessage = Just "HTTP connection to github.com failed", errInternal = Just "HttpExceptionRequest Request {\n host = \"api.github.com\"\n port = 443\n secure = True\n requestHeaders = [(\"Accept\",\"application/vnd.github+json\"),(\"User-Agent\",\"Daml-Assistant/0.0\")]\n path = \"/repos/digital-asset/daml/releases\"\n queryString = \"?per_page=100\"\n method = \"GET\"\n proxy = Nothing\n rawBody = False\n redirectCount = 10\n responseTimeout = ResponseTimeoutMicro 10000000\n requestVersion = HTTP/1.1\n proxySecureMode = ProxySecureWithConnect\n}\n (InternalException (HandshakeFailed (Error_Protocol (\"certificate rejected: security: createProcess: posix_spawnp: failed (Undefined error: 0)\",True,CertificateUnknown))))"}
2023-11-13T14:16:38.4401760Z Use -p '$0=="DA.Daml.Assistant.DA.Daml.Assistant.Env.getDispatchEnv.getDispatchEnv should override getDamlEnv"' to rerun this test only.
2023-11-13T14:16:38.4404210Z getDispatchEnv should override getDamlEnv (2): OK
2023-11-13T14:16:38.4405530Z DA.Daml.Assistant.Install
2023-11-13T14:16:38.4406510Z initial install a tarball: OK (0.01s)
2023-11-13T14:16:38.4407890Z unix-specific tests
2023-11-13T14:16:38.4408950Z initial install a tarball from symlink: OK
2023-11-13T14:16:38.4410530Z reject an absolute symlink in a tarball: OK
2023-11-13T14:16:38.4413370Z reject an escaping symlink in a tarball: OK
2023-11-13T14:16:38.4415210Z check that relative symlink is used in installation: OK
2023-11-13T14:16:38.4416240Z
2023-11-13T14:16:38.4416740Z 1 out of 30 tests failed (0.11s)
Caused by a difference in MacOS and Linux behaviour.
multi-package tests:
Changing ghc-options, or other `build-options` should invalidate the cache: FAIL
Exception: ./package-b/daml.yaml: openFile: resource busy (file is locked)
multi-package tests:
Changing ghc-options, or other `build-options` should invalidate the cache: FAIL Exception: ./package-b/daml.yaml: openFile: resource busy (file is locked)
Has not occurred in recent rebuilds of the compiler. Will wait until this pops up again.
I've had trouble on Windows with //daml-script/runner:tests
, https://dev.azure.com/digitalasset/adadc18a-d7df-446a-aacb-86042c1619c6/_apis/build/builds/161929/logs/160
Some flakiness for //daml-lf/validation:upgrade-tests
- maybe one in 5 times, it ends with what seems to be CantonFixture exiting early.
Got a flake from //daml-script/runner:tests
on Windows, restarting appears to have fixed it
Following are the tests identified as flaky under the Daml language/damlc team.
Note - the
"Scenario service backend error: BErrorFail StatusDeadlineExceeded"
error is caused by previously unhandled failures. After #16503 is merged, we should always get more information in these cases.Unexpected exception on request, please report!
Received RST_STREAM with error code 8
- some kind of timeoutDeadlineExceeded
from Grpc