filecoin-project / go-data-transfer

Data Transfer Shared Component for go-filecoin & go-lotus
Other
39 stars 17 forks source link

racy connection management makes channel stall #180

Open raulk opened 3 years ago

raulk commented 3 years ago

Running the whitenoise tests and using an interruption policy of 0.2/1s (20% probability that the connection will be interrupted every 1s), the opening of the push channel seems to block forever.

This is really bad because it means that the system can't make progress and effectively the sender is seized because go-data-transfer never returns control

Output:

Mar 26 15:14:26.862276  INFO    2.9067s    MESSAGE << receiver[000] (a06fd2) >> all networks configured
Mar 26 15:14:26.862369  INFO    2.9067s    MESSAGE << receiver[000] (a06fd2) >> transfer starting
Mar 26 15:14:26.862412  INFO    2.9067s    MESSAGE << receiver[000] (a06fd2) >> we are the receiver
Mar 26 15:14:26.865175  INFO    2.9097s    MESSAGE << sender[000] (154eca) >> all networks configured
Mar 26 15:14:26.865301  INFO    2.9097s    MESSAGE << sender[000] (154eca) >> transfer starting
Mar 26 15:14:26.865394  INFO    2.9097s    MESSAGE << sender[000] (154eca) >> we are the sender
Mar 26 15:14:30.136669  INFO    6.1811s    MESSAGE << sender[000] (154eca) >> import took: 3.27130671s
Mar 26 15:14:32.643670  INFO    8.6882s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:32.643786  INFO    8.6884s    MESSAGE << sender[000] (154eca) >> opening the push data channel
Mar 26 15:14:34.650872  INFO    10.6954s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:37.668708  INFO    13.7132s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:44.188653  INFO    20.2332s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:48.857653  INFO    24.9022s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------

Here are three goroutine traces 2 minutes apart from one another:

stall.zip

dirkmc commented 3 years ago

If we fail to even open the channel, the expected behaviour is that it will fail the transfer immediately (it doesn't try to restart)

Once the channel is open (once we receive an Accept from the other side) it should attempt restarts.

Note also that you have to explicitly set the config in order to enable reconnect behaviour, see the config in lotus: https://github.com/filecoin-project/lotus/blob/885ecb97ad631fc64f538034390648e4da69966c/node/modules/client.go#L126-L140

raulk commented 3 years ago
  1. What I'm observing is that the opening blocks entirely -- it does not fail immediately.
  2. Whitenoise is already setting the retry params: https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L105-L113