Open raulk opened 4 years ago
@nonsense @vyzo I edited the description to add the details provided by Hannah.
A couple of observations from running with 300 deals, both serial and concurrent with 2 miners and 3 clients.
In the serial case, after an overnight run, one client succeeded, but the other two got terminally stuck in a StorageDealSealing
state.
In the concurrent case, almost all deals get stuck in the StorageDealSealing
state.
@vyzo thanks for the input! We need to make this actionable so that we can investigate further. It may as well be an issue on our end. Possibly related to the fact that we're in catch-up mining mode and miners may be building separate chains. Could you please upload the logs from both runs?
So digging further in the concurrent stress test logs, the miners just stopped at block 155; no errors.
We have reported the issues we found upstream: https://github.com/filecoin-project/lotus/issues/2294.
https://github.com/filecoin-project/lotus/issues/2293 https://github.com/filecoin-project/lotus/issues/2292 https://github.com/filecoin-project/lotus/issues/2291 https://github.com/filecoin-project/lotus/issues/2250 https://github.com/filecoin-project/lotus/issues/2249 https://github.com/filecoin-project/lotus/issues/2294
What would you like us to test?
Stress testing for % of deals that go through in adverse conditions (e.g. nodes suddenly going offline, etc).
Technical implementation details.
Also generate a baseline that captures how the system behaves normal/ideal conditions.
What should we measure?
On a scale from 0-10, what's the proposed _discomfort factor_? In other words, how uncomfortable would you be if we went live without having tested this? Explain why.
TBD.
Additional remarks.
Requestor: @hannahhoward.