feat(testing): decide infra started for testing to prevent IBC/XCM regressions and improve CVM delivery speed

dzmitry-lahoda commented 10 months ago

Motivation

With current approach can run and debug automatically just couple of tests on CVM/MANTIS tests, until some infra setuped, expanding and debugging will be pain. So it is pain already, requires manual layer tracing. Cannot SSH into CI runner processes, runners has no good way to handle dependencies and aborts.

Need more structured approach.

Suggested Solution

Need next reproducible setup:

can test local(linux at least) and in cloud
network - container running networks
test-layer0 - https://github.com/ComposableFi/composable/issues/4404
test -layet1 - CVM test wich uses network run after layer-0
controller - in case of network failures, abort all, start test only after chains produce blocks

good API to report error reason to CI (ci will call these and wait)

need ability to connect to outputs of running test per process for ease of debug, not big ball of text in zip wirh everything intermingled like it now.

after run can access ALL artefacts from run (logs, chain states, configs)

possible tecks:

testcontainers
microvm.nix(local) or podman nix + pulumi(dynamic resource rollup and down)
some k8s things????

Alternatives

Continue manual QA insurance, which is time consuming and easy breakable by people who do not test all pieces

Additional Information

https://composableprotocol.slack.com/archives/C05EMF8DYKU/p1704646575980879

Given my experience testing, there is no other option to have sane tests

dzmitry-lahoda commented 10 months ago

cc @rjonczy

blasrodri commented 10 months ago

Why will be it painful? We should be good to go if we had a way to ensure:

there's connectivity between the components
proper logging
component metrics (cpu consumption, ram, etc.)

dzmitry-lahoda commented 10 months ago

because cannot and/or hard debug failures until runtime is structured and can access what is going on in realtime and after end

dzmitry-lahoda commented 10 months ago

@blasrodri here is current script for test

https://github.com/ComposableFi/composable/blob/c2fc00ac1a4ab89374d94fd08cf2440bcba1fbfd/tests/flake-module.nix#L35

During 1 year it existing, only I was able to write non brittle and non flaky tests again Picasso and ComposableCosmos runtime. 2 QA engineers failed.

Also senior developers were unable to analyze reasons of failures.

From here my conclusion that running test inside bash script on CI is not enough for many reason nor I can bear any decision about my own on infra prices.

Just for my thing I tell it is hard too, just slow, but yet need for me and as per reports by others (devnet to work need to be automated - or Kostya follow more elaborated release testing procedure which may take a day).

The list is you provied, is good one but not enough, neither context is clear about accessibility of instruments, nor level of connectivity tested (which I consider to be https://github.com/ComposableFi/composable/issues/4404 already here by Yasin/Gloria at some fork easy to package for reuse and Yasin will be able CI these easy - so if Yasin will not be able to do so - it must be hard)

rjonczy commented 10 months ago

@dzmitry-lahoda As I understand, this e2e tests will run on ci.

How about we:

run on temporary vms

create a dedicated cloud project, where we provision infrastructure temporarily
deploy stuff
run e2e tests
generate report and upload somewhere
destroy infrastructure after running tests

or

run tests on k8s cluster
- create separate cloud project
- have k8s cluster scale to 0 instances with autoscalling on
- deploy containers on cluster, which triggers automatic scale up
- run tests
- generate report and upload somewhere
- uninstall containers which triggers scale down of k8s nodes

dzmitry-lahoda commented 10 months ago

this e2e tests will run on ci.

Triggered by CI, but the problem of run on CI is problemtic (need external runners)

dzmitry-lahoda commented 10 months ago

1 and 2 - yes.

also:

important ability to access to per process/container logs as things run. access to CPU/MEM (failures of CPU and mem).
ability to express complicated dependencies of one container on other (docker-compose sucks, it is more like k8s controller like thing; in GH actions expression it is impossible). basically need to run custom code to detect failures or readyness, and abort.
usable by Yasin (so not so complicated on surface)
not heavy to maintain (because it may be so complicated so that better to buy Yasin and me own personal servers and rung things locally)
in future easy to add GPU for Solana/Ehtereym provers (yet another reason not local nor CI)

dzmitry-lahoda commented 10 months ago

@rjonczy what do you think about testcontainers or alternatives https://stackshare.io/testcontainers/alternatives ?

me some tooling build on top of k8s specicially for test flow 1. runtime access . 2 .data export after run 3. active graph of dependenies (inliduing - if block stopedd producing, stop whole test, so that it is clear that CVM is not buggy)

so this is subject of issue - what is best way to do it?

dzmitry-lahoda commented 10 months ago

or https://stackshare.io/argo

dzmitry-lahoda commented 10 months ago

looks nice https://github.com/argoproj/argo-workflows?tab=readme-ov-file#features

dzmitry-lahoda commented 10 months ago

I think next:

things as is retained, only frustration resistant people can debug, slowly and hardly, CI tests. amount of tests are limited. Yasin tests are not running until he able to understand how to stabilize.
i start slowly writing pulumi/aws/nix/microvms like things, just to make things little bit better, so slowly, only when I really need it. or Yasing can write:)
naked k8s or vms. need to write orchestrator on top. I am good with that, but it will take time.
given argo cd feature list, seems it already built on top of 3. so me and yasin easily can use it.

so I will stick 1 and small parts of 2 if blocked.

if @blasrodri and @rjonczy would decide to help, 4 imho best one

cc @kollegian

dzmitry-lahoda commented 10 months ago

I almost happy with argocd https://argo-workflows.readthedocs.io/en/latest/running-nix/

Remaining part for me zero cost containers builds (basically container should be build on top of exisitng binaries in second)

dzmitry-lahoda commented 10 months ago

https://composableprotocol.slack.com/archives/C05KUR705DJ/p1704821010913439

dzmitry-lahoda commented 10 months ago

Will go with whatever faster to get.

ComposableFi / composable

feat(testing): decide infra started for testing to prevent IBC/XCM regressions and improve CVM delivery speed #4405