Libp2p randomly throws unhandled error

EmiM commented 8 months ago

Libp2p sometimes throws exception which is visible for user as "abnormal backend termination" (as all other unhandled backend errors).

Our hypothesis is that it's because of our websocketOverTor and that tor changes the behavior of transport. Libp2p is not prepared for that and does not handle the error properly.

~We can probably just ignore this particular error for now.~

Errors I got:

node:internal/event_target:1006
  process.nextTick(() => { throw err; });
                           ^

Error: stream ended before 1 bytes became available
    at eval (webpack://@quiet/backend/./node_modules/it-reader/dist/src/index.js?:59:33)
    at async Object.next (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/dist/src/multistream.js?:63:27)
    at async abortable (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/node_modules/abortable-iterator/dist/src/index.js?:38:26)
    at async decoder (webpack://@quiet/backend/./node_modules/it-length-prefixed/dist/src/decode.js?:37:26)
    at async first (webpack://@quiet/backend/./node_modules/it-first/index.js?:11:20)
    at async eval (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/dist/src/multistream.js?:75:243)
    at async read (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/dist/src/multistream.js?:75:17)
    at async Module.readString (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/dist/src/multistream.js?:85:17)
    at async Module.select (webpack://@quiet/backend/./node_modules/@libp2p/multistream-select/dist/src/select.js?:38:20)
    at async ConnectionImpl.newStream [as _newStream] (webpack://@quiet/backend/./node_modules/libp2p/dist/src/upgrader.js?:336:50) {
  code: 'ERR_UNDER_READ',
  buffer: Uint8ArrayList { bufs: [], length: 0 }

  node:internal/event_target:1006
  process.nextTick(() => { throw err; });
                           ^

AbortError: The operation was aborted
    at nextAbortHandler (webpack://@quiet/backend/./node_modules/libp2p/node_modules/abortable-iterator/dist/src/index.js?:34:32)
    at EventTarget.abortHandler (webpack://@quiet/backend/./node_modules/libp2p/node_modules/abortable-iterator/dist/src/index.js?:21:17)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:731:20)
    at EventTarget.dispatchEvent (node:internal/event_target:673:26)
    at abortSignal (node:internal/abort_controller:308:10)
    at TimeoutController.abort (node:internal/abort_controller:338:5)
    at TimeoutController.abort (webpack://@quiet/backend/./node_modules/timeout-abort-controller/index.js?:26:18)
    at eval (webpack://@quiet/backend/./node_modules/timeout-abort-controller/index.js?:16:38)
    at Retimer._timerWrapper (webpack://@quiet/backend/./node_modules/retimer/retimer.js?:21:18)
    at listOnTimeout (node:internal/timers:564:17) {
  type: 'aborted',
  code: 'ABORT_ERR'
}

holmesworcester commented 8 months ago

If we ignore the error does libp2p continue to work properly?

EmiM commented 7 months ago

Ok, ignoring the error is not recommended. Unhandled exception will leave the process in undefined state. https://nodejs.org/api/process.html#warning-using-uncaughtexception-correctly

Probably the only real solution would be forking libp2p and handling those errors ourselves.

Workarounds:

restart backend without closing the app. IOS already does that so that's possible.
add (bring back?) error modal which would appear on 'abnormal backend termination' with a button for restarting application. This would be better from UX perspective than throwing js error in user's face.

Moving this task to blocked until we decide what to do.

holmesworcester commented 7 months ago

Restarting backend seems like a good workaround.

All errors would trigger a restart?

holmesworcester commented 7 months ago

Moved out of blocked since it seems like we have a plan.

EmiM commented 7 months ago

All errors would trigger a restart?

All unhandled errors

leblowl commented 7 months ago

Thanks for finding this! Some initial questions come to mind. Do we feel fairly confident in the cause of the error?

Then looking at this part of the error:

Error: stream ended before 1 bytes became available ... ConnectionImpl.newStream

I'm not entirely sure if I understand what is happening, but I think it makes sense that sometimes a stream wouldn't product data quickly enough because Tor is slow. If a stream isn't receiving data, then can we simply recreate the stream? And retry until we do receive data? Does libp2p recreate the stream for us?

Other thoughts:

Restarting the backend could be a good quick fix, but it also seems like an expensive thing to do. We might be able to see if there is a fix closer to where the error occurs. To me, I see a couple layers of solutions:

prevent the error
recover from the error by restarting a stream or connection
recover from the error by restarting libp2p
recover from the error by restarting backend

Of course there are a lot more options, but that is an example of some layers to present an idea.

holmesworcester commented 7 months ago

Our decision: we will restart orbitdb/libp2p when we have an unhandled exception and we will spend 1 day investigating the errors we see to understand them and possibly find a fix.

EmiM commented 6 months ago

I was thinking if maybe debugging it or handling it in a special way makes sense right now:

We are using older version of libp2p so I feel like debugging may be the waste of time.
Restarting services may require to handle some edge cases when something could go wrong.
We are in a process of planning architecture which may lead to serious changes in the backend anyway.
The error occurs relatively rarely

So maybe just show the pretty error-modal to user (with a stacktrace) and ask to restart the app? @holmesworcester what do you think?

holmesworcester commented 6 months ago

For me the error occurs very consistently when I leave Quiet running. It takes a while but it always happens eventually.

Could we restart the backend automatically when it happens?

EmiM commented 6 months ago

If you are talking about the whole backend then maybe? I don't know if this is already handled on frontend - case when backend is being restarted and user performs actions at the same time, e.g tries to send a message or tries to create a channel.

holmesworcester commented 6 months ago

It happens rarely enough that if we could temporarily freeze the frontend and show some "Restarting backend..." message, that would not be disruptive.

Fixing the problem is preferable, but there might be other problems that emerge in the future and it would be great if our backend is "self healing" when it encounters an error.

Also, does orbitdb need ipfs to be running to add and remove data? Can we sit around ipfs or libp2p and restart when we catch an error?

What is the cost of attempting to upgrade OrbitDB now? Should we try it, see how hard it is, and choose to restart the backend if it's too hard?

holmesworcester commented 4 months ago

Let's revisit this after we upgrade libp2p.

TryQuiet / quiet

Libp2p randomly throws unhandled error #2055