fetchai / ledger-subquery

This is the Fetch Ledger SubQuery project, an indexer for the Fetch network.
MIT License
2 stars 10 forks source link

Bug report: flickering tests #228

Open bryanchriswhite opened 1 year ago

bryanchriswhite commented 1 year ago

Prerequisites

Expected Behavior

CI should be more deterministic than they currently are. More often than not, I find myself having to re-run tests, sometimes more than once.

Current Behavior

Cumulative E2E test behavior

| Deterministic                      |                            Chaotic |
| <----------------------------------|------------*---------------------> |
|                                    |            ^                       |
                                        getting uncomfortably far
                                       from "Determinstic" and
                                      trending in the wrong direction

We've observed flickering in the end-to-end suite which covers this repo, as is evident by CI re-run going back some time. I've classified these issues given recent examples but this classification may not be exhaustive:

Schema related errors

These have been around the longest as far as I recall. The specific entity and relation vary from run to run. Our working hypothesis has been that it is caused by the graphql-engine service coming up before the subquery-node has finished setting up the DB schema. On boot, graphql-engine will scan the DB schema and use that to determine how to define it's GQL resolvers, which if done before all tables have finished generating, would likely result in this kind of error. It's clearly a failure of integration at runtime as the schema and resulting generated code is all present and correct. This hypothesis is corroborated by the fact that when this phenomenon is experienced locally, restarting the graphql-engine consistently fixes the issue (whereas restarting the docker composition does not necessarily).

======================================================================
ERROR: test_blocks_query (entities.test_authz_exec.TestAuthzExec)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_authz_exec.py", line 179, in test_blocks_query
    result = self.gql_client.execute(query)
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 388, in execute
    data = loop.run_until_complete(
  File "/opt/hostedtoolcache/Python/3.9.[16](https://github.com/fetchai/ledger-subquery/actions/runs/3749061679/jobs/6367112249#step:13:17)/x64/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 285, in execute_async
    return await session.execute(
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 1220, in execute
    result = await self._execute(
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 1111, in _execute
    self.client.validate(document)
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 149, in validate
    raise validation_errors[0]
graphql.error.graphql_error.GraphQLError: Cannot query field 'subMessages' on type 'AuthzExec'. Did you mean 'message' or 'messageId'?

GraphQL request:12:25
11 |                         grantee
12 |                         subMessages {
   |                         ^
13 |                             nodes {

======================================================================
ERROR: test_primitives_query (entities.test_native_primitives.TestNativePrimitives) [primitive id by regex]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_native_primitives.py", line 338, in test_primitives_query
    result = self.gql_client.execute(query)
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 388, in execute
    data = loop.run_until_complete(
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 285, in execute_async
    return await session.execute(
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 12[20](https://github.com/fetchai/ledger-subquery/actions/runs/3749061679/jobs/6367112249#step:13:21), in execute
    result = await self._execute(
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 1111, in _execute
    self.client.validate(document)
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/gql/client.py", line 149, in validate
    raise validation_errors[0]
graphql.error.graphql_error.GraphQLError: Cannot query field 'attributes' on type 'Event'. Did you mean 'eventAttributes'?

GraphQL request:12:29
11 |                             }
12 |                             attributes {
   |                             ^
13 |                                 nodes {

----------------------------------------------------------------------
Ran 44 tests in 339.[23](https://github.com/fetchai/ledger-subquery/actions/runs/3749061679/jobs/6367112249#step:13:24)0s

FAILED (errors=2)

Contract upload (and/or download) related errors

This one is pretty new as far as I can recall but is effecting old tests. I suspect it may have been introduced in or around #187. Perhaps, specifically, commit 6ad556bdfa0b3bdfca5145ec54e0b625d08cdba4.

======================================================================
ERROR: setUpClass (entities.test_legacy_bridge_swap.TestContractSwap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_legacy_bridge_swap.py", line 25, in setUpClass
    cls._contract._store()
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/helpers/contracts.py", line 161, in _store
    return self.store(self.admin, self.gas_limit)
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/cosmpy/aerial/contract/__init__.py", line 169, in store
    submitted_tx = prepare_and_broadcast_basic_transaction(
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/cosmpy/aerial/tx_helpers.py", line 178, in wait_to_complete
    self._response.ensure_successful()
  File "/home/runner/.local/share/virtualenvs/ledger-subquery-lKQxoaNQ/lib/python3.9/site-packages/cosmpy/aerial/tx_helpers.py", line [96](https://github.com/fetchai/ledger-subquery/actions/runs/3758060514/jobs/6385954111#step:15:97), in ensure_successful
    raise BroadcastError(self.hash, self.raw_log)
cosmpy.aerial.exceptions.BroadcastError: failed to execute message; message index: 0: Null/Nil argument: wasm: create wasm contract failed

----------------------------------------------------------------------
Ran 36 tests in 211.813s

FAILED (errors=5)

Order related errors

I recall these first appearing shortly after the order by features where already merged. As far as I can recall, this one has only ever looked exactly like this, no variation as the other classes (i.e. always '9' not less than or equal to '10' ... on the Cw20BalanceChange entity test). Annecdotally (see: #198, specifically b357a9), deleting the respective actions cache may have an effect in this case.

======================================================================
FAIL: test_retrieve_balance_change (entities.test_CW20_balance_change.TestCw20BalanceChange) [order by block height ascending]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_CW20_balance_change.py", line 246, in test_retrieve_balance_change
    orderAssert(
AssertionError: '10' not greater than or equal to '9' : OrderAssertError: order of objects is incorrect

======================================================================
FAIL: test_retrieve_balance_change (entities.test_CW[20](https://github.com/fetchai/ledger-subquery/actions/runs/3758060514/jobs/6387093094#step:15:21)_balance_change.TestCw20BalanceChange) [order by block height descending]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_CW20_balance_change.py", line 246, in test_retrieve_balance_change
    orderAssert(
AssertionError: '9' not less than or equal to '10' : OrderAssertError: order of objects is incorrect

----------------------------------------------------------------------
Ran 48 tests in 337.486s

FAILED (failures=2)

Empty table errors

This is another older one.

======================================================================
FAIL: test_claim_rewards (entities.test_delegation_reward_claim.TestDelegation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_delegation_reward_claim.py", line 47, in test_claim_rewards
    self.assertIsNotNone(
AssertionError: unexpectedly None : 
DBError: table is empty - maybe indexer did not find an entry?

======================================================================
FAIL: test_contract_swap (entities.test_legacy_bridge_swap.TestContractSwap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_legacy_bridge_swap.py", line 43, in test_contract_swap
    self.assertIsNotNone(
AssertionError: unexpectedly None : 
DBError: table is empty - maybe indexer did not find an entry?

======================================================================
FAIL: test_native_transfer (entities.test_native_transfer.TestNativeTransfer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_native_transfer.py", line 39, in test_native_transfer
    self.assertIsNotNone(
AssertionError: unexpectedly None : 
DBError: table is empty - maybe indexer did not find an entry?

----------------------------------------------------------------------
Ran 42 tests in [19](https://github.com/fetchai/ledger-subquery/actions/runs/3584098635/jobs/6030334598#step:14:20)0.070s

FAILED (failures=3)

Authz

I discovered this one when looking through old CI logs, it seems to have presented (multiple times) during work on the Authz executions handler (#147). I also spotted it in #173.

======================================================================
FAIL: test_exec_delegate (entities.test_authz_exec.TestAuthzExec)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ledger-subquery/ledger-subquery/tests/e2e/entities/test_authz_exec.py", line 112, in test_exec_delegate
    self.assertEqual(len(messages), self.expected_msgs_len)
AssertionError: 7 != 5

----------------------------------------------------------------------
Ran 35 tests in 178.[9](https://github.com/fetchai/ledger-subquery/actions/runs/3280709924/jobs/5401829244#step:11:10)53s

FAILED (failures=1)

Steps To Reproduce

Open a PR and wait for CI to fail. :sweat_smile:

If CI passes on any given run you can get it it to again by pushing an empty commit:

git commit --allow-empty -m 'poke CI'
git push

Context

This seems pretty consistent in CI in my experience but can also happen when running everything locally as well. I can't recall definitively if we've observed this happening in production (i.e. kubernetes as opposed to docker compose) but it is something we've been checking for in a sanity check which goes in between sync completion and switching the routing for the public DNS at the end of our deployment operation pipeline. In practice this entails running a GQL query which requests the total count of each entity. This at least ensures that the tables themselves were all present when the graphql-engine (labeled "api" in helm/kubernetes files) started. Because graphql type/entity fields map to DB table columns and column and table creation happen are in the same atomic operation, we don't have to check that all fields are included in the resulting graphql schema. However, this would be fairly trivial via the protocol introspection integration included in the playground UI.

Failure Logs

Inlined with "Current Behavior"