filecoin-project / venus

Filecoin Full Node Implementation in Go
https://venus.filecoin.io
Other
2.06k stars 463 forks source link

consider replacing daemon tests with iptb #916

Closed phritz closed 5 years ago

phritz commented 6 years ago

Placeholder for @frrist to brain dump into.

frrist commented 6 years ago

Some History

I'll start by taking a quick tour of how we got to where we are now: We have been dealing with transient test failures for some time now...and by dealing with I mean opening/adding to an issue and hitting rebuild, this is creating technical debt as many of these issues are not easy to reproduce and lack contextual information. In an ideal world when a test fails the logs and metrics for it are easily accessible, sadly this is not the case. I believe this is in part due to TestDaemon being poorly designed. The TestDaemon was oringally created to address issue https://github.com/filecoin-project/go-filecoin/issues/87, and resulted in PR https://github.com/filecoin-project/go-filecoin/pull/103 being merged -- at which point it was basically a glorified wrapper around exec.Command, and to be fair we didn't need much more than that at the time. Over the past months we have added a suite of helper methods -- exec.Command calls with an assertion wrapper -- these include methods like RunSuccess, ShutdownSuccess, RunFail, etc. We then built, what I will call Actions, on top of these helper methods. These actions are typically a sequence of go-filecoin commands performed by the TestDaemon. They include methods like CreateMinerAddr, WaitForMessageRequireSuccess, MineAndPropagate, MakeDeal etc.

Alright, enqueue that and let's look at some work that has been going on in parallel.

IPTB

As a reminder from the README.md:

iptb is a program used to create and manage a cluster of sandboxed nodes on a computer. Spin up 1000s of nodes! Using iptb makes testing libp2p networks easy!

Much has been done wrt IPTB over the past 2 months, it got a major upgrade that allows it to work with go plugins, had a total rewrite of the CLI, and received new interface definitions to make it more extensible (changes can don via https://github.com/ipfs/iptb/pull/61). These changes were used to create filecoin plugins for IPTB. Currently there exists a LocalFilecoin plugin that allows filecoin to be ran locally as a system process and a DockerFilecoin plugin that allows filecoin to be controlled via a docker daemon (Note: the docker daemon can be local or a remote, IPTB does not care). As more plugins are added we will have more ways of controlling filecoin (AWSFilecoin plugin has been an idea as well as a KubernetesFilecoin plugin, we may also consider a WindowsFilecoin plugin in the future). The take away from this: IPTB is gaining momentum as tool PLer's use, and serves as a powerful interface between calling code and the plugins that implement its interface.

Okay, enqueue this and let's look at one more thing

:hankey:y Hawk

We are very close to deploying a persistent test cluster in the cloud, affectionately referred to as Kitty Hawk. One of the main goals for creating this cluster is to test filecoin in a realistic setting, in order to achieve this goal a couple things need to be figured out:

  1. How will developers interact/control the nodes in this cluster?
  2. How can we run tests/perform random actions against the nodes in this cluster?

I believe the above points boil down to roughly the same question, How are filecoin nodes in the cluster controlled? Enqueue this and let's move on to the brain dump.

Brain Dump

dequeue As was stated at the beginning of this comment, TestDaemon is poorly designed, it lacks any sort of interface, doesn't allow for observability, and isn't extensible -- all the helper methods and actions we have defined only work with it.

dequeue Much work has been done on IPTB to allow it to easily control groups of filecoin nodes, logs & metrics from the nodes are easily accessible through a common interface, it allows users to remain agnostic to where/how the filecoin nodes are ran thanks to plugins -- to drive home the point further, if someone made a plugin that allowed filecoin to be ran on windows or OSX the nodes manages by that plugin could be controlled just like any other IPTB node. (Side Note: I am aware that plugins don't work on windows(!) -- There is a work around with IPTB s.t. this doesn't matter -- it will work :smile:).

I think there is much common ground between the 2 items we just pulled from our queue -- IPTB. I believe we should move to replace the TestDaemon with IPTB immediately. IPTB is actively being developed by the IPFS community at large, and is used to test go-ipfs - and eventually js-ipfs with the addition of the BrowserIPFS plugin. I am doubtful that the filecoin team can/should develop TestDaemon to a point where it can compete with IPTB. Boom, imagine we have replaced TestDaemon and all of its actions with IPTB, now...

dequeue Kitty Hawk as sprouted fins and thrusters (:rocket:) and is beginning to take flight, however it still lacks:

As was stated earlier, IPTB has a DockerFilecoin plugin that can control filecoin nodes via the docker daemon, since the nodes in the test cluster are ran as docker containers, it follows that IPTB can be used to control them both remotely and locally. Since we just replaced TestDaemon with IPTB, running tests against this cluster will be a breeze as many of the actions needed to perform them are already written using IPTB.

Now how about the random actions against the cluster? -- Last week we created the initial Design Doc and MVP describing what this may look like -- We are calling it the Filecoin Network Randomizer FCNR for short. In this doc we describe what an "Action" is -- an activity performed against a filecoin node, we also outline Action-Preconditions and Sequence, please read the doc for a full definition.

Read the doc? -- Great! I propose we take the actions we have written above for our TestSuite and re-use them with the randomizer. Furthermore any action not already being performed should be added as an Action or Sequence to said randomizer and thus the testsuite since they share a common interface, IPTB!

Smashing this all together, we now have a common interface to control, test and randomize our filecoin nodes. What is done for testing is instantly reusable for the randomizer and what is done for the randomizer is instantly reusable for the TestSuite, further more when we need to begin testing against some other environment we write an IPTB plugin for it and it "just works" with all the testing / randomization we have already written.

dignifiedquire commented 6 years ago

Nearly none of our transient test failures that we have observed are actually related to the TestDaemon, but to the fact that some things, like connecting two nodes takes different times. So I am very doubtful that changing to IPTB will fix or change much. On 18. Sep 2018, 03:15 +0200, Frrist notifications@github.com, wrote:

Some History I'll start by taking a quick tour of how we got to where we are now: We have been dealing with transient test failures for some time now...and by dealing with I mean opening/adding to an issue and hitting rebuild, this is creating technical debt as many of these issues are not easy to reproduce and lack contextual information. In an ideal world when a test fails the logs and metrics for it are easily accessible, sadly this is not the case. I believe this is in part due to TestDaemon being poorly designed. The TestDaemon was oringally created to address issue #87, and resulted in PR #103 being merged -- at which point it was basically a glorified wrapper around exec.Command, and to be fair we didn't need much more than that at the time. Over the past months we have added a suite of helper methods -- exec.Command calls with an assertion wrapper -- these include methods like RunSuccess, ShutdownSuccess, RunFail, etc. We then built, what I will call Actions, on top of these helper methods. These actions are typically a sequence of go-filecoin commands performed by the TestDaemon. They include methods like CreateMinerAddr, WaitForMessageRequireSuccess, MineAndPropagate, MakeDeal etc. Alright, enqueue that and let's look at some work that has been going on in parallel. IPTB As a reminder from the README.md:

iptb is a program used to create and manage a cluster of sandboxed nodes on a computer. Spin up 1000s of nodes! Using iptb makes testing libp2p networks easy! Much has been done wrt IPTB over the past 2 months, it got a major upgrade that allows it to work with go plugins, had a total rewrite of the CLI, and received new interface definitions to make it more extensible (changes can don via ipfs/iptb#61). These changes were used to create filecoin plugins for IPTB. Currently there exists a LocalFilecoin plugin that allows filecoin to be ran locally as a system process and a DockerFilecoin plugin that allows filecoin to be controlled via a docker daemon (Note: the docker daemon can be local or a remote, IPTB does not care). As more plugins are added we will have more ways of controlling filecoin (AWSFilecoin plugin has been an idea as well as a KubernetesFilecoin plugin, we may also consider a WindowsFilecoin plugin in the future). The take away from this: IPTB is gaining momentum as tool PLer's use, and serves as a powerful interface between calling code and the plugins that implement its interface. Okay, enqueue this and let's look at one more thing 💩y Hawk We are very close to deploying a persistent test cluster in the cloud, affectionately referred to as Kitty Hawk. One of the main goals for creating this cluster is to test filecoin in a realistic setting, in order to achieve this goal a couple things need to be figured out:

  1. How will developers interact/control the nodes in this cluster?
  2. How can we run tests/perform random actions against the nodes in this cluster?

I believe the above points boil down to roughly the same question, How are filecoin nodes in the cluster controlled? Enqueue this and let's move on to the brain dump. Brain Dump dequeue As was stated at the beginning of this comment, TestDaemon is poorly designed, it lacks any sort of interface, doesn't allow for observability, and isn't extensible -- all the helper methods and actions we have defined only work with it. dequeue Much work has been done on IPTB to allow it to easily control groups of filecoin nodes, logs & metrics from the nodes are easily accessible through a common interface, it allows users to remain agnostic to where/how the filecoin nodes are ran thanks to plugins -- to drive home the point further, if someone made a plugin that allowed filecoin to be ran on windows or OSX the nodes manages by that plugin could be controlled just like any other IPTB node. (Side Note: I am aware that plugins don't work on windows(!) -- There is a work around with IPTB s.t. this doesn't matter -- it will work 😄). I think there is much common ground between the 2 items we just pulled from our queue -- IPTB. I believe we should move to replace the TestDaemon with IPTB immediately. IPTB is actively being developed by the IPFS community at large, and is used to test go-ipfs - and eventually js-ipfs with the addition of the BrowserIPFS plugin. I am doubtful that the filecoin team can/should develop TestDaemon to a point where it can compete with IPTB. Boom, imagine we have replaced TestDaemon and all of its actions with IPTB, now... dequeue Kitty Hawk as sprouted fins and thrusters (🚀) and is beginning to take flight, however it still lacks:

• A common interface to control the nodes within it.

• Tests to run against it
• Random actions continuously performed on it to create a realistic setting for filecoin to run in.

As was stated earlier, IPTB has a DockerFilecoin plugin that can control filecoin nodes via the docker daemon, since the nodes in the test cluster are ran as docker containers, it follows that IPTB can be used to control them both remotely and locally. Since we just replaced TestDaemon with IPTB, running tests against this cluster will be a breeze as many of the actions needed to perform them are already written using IPTB. Now how about the random actions against the cluster? -- Last week we created the initial Design Doc and MVP describing what this may look like -- We are calling it the Filecoin Network Randomizer FCNR for short. In this doc we describe what an "Action" is -- an activity performed against a filecoin node, we also outline Action-Preconditions and Sequence, please read the doc for a full definition. Read the doc? -- Great! I propose we take the actions we have written above for our TestSuite and re-use them with the randomizer. Furthermore any action not already being performed should be added as an Action or Sequence to said randomizer and thus the testsuite since they share a common interface, IPTB! Smashing this all together, we now have a common interface to control, test and randomize our filecoin nodes. What is done for testing is instantly reusable for the randomizer and what is done for the randomizer is instantly reusable for the TestSuite, further more when we need to begin testing against some other environment we write an IPTB plugin for it and it "just works" with all the testing / randomization we have already written. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

whyrusleeping commented 6 years ago

@dignifiedquire I think thats missing the point.

The point here is that iptb is easily reuseable tooling that makes working with tests simpler and more easily debuggable. So the tests we write here can be taken and run on their own, and debugged independently against nodes that you can easily query using your normal tooling (a test fails, i can check things out with go-filecoin swarm peers etc)

dignifiedquire commented 6 years ago

@whyrusleeping I do agree switching to IPTB is a good goal. But I am not convinced that missing debugability is the reason why we have our current transient test failures.

dignifiedquire commented 6 years ago

(maybe this issue just needs a different name)

whyrusleeping commented 6 years ago

Oh, i missed the issue title and just read forrests comment. That said, I can see how being able to reproduce the flaky tests in an easier to play around with environment would improve our ability to fix them.

frrist commented 5 years ago

We have have come to the conclusion that we will be replacing TestDaemon with whats described in #1612. Closing.