Report retrieval network errors

bajtos commented 7 months ago

We are seeing measurements where start_at and finished_at are filled correctly, but:

first_byte_at and end_at are set to new Date(0)
status_code is null
timeout is false

We think these measurements are created when the request fetch('ipfs://...') fails before receiving response headers.

It would be great to get more insights into this case.

bajtos commented 7 months ago

Currently, further error details will be printed to the operator's console, in https://github.com/filecoin-station/spark/blob/a6098496c67d016794dd98af5b08d33bcf6e8157/lib/spark.js#L151

Looking at https://developer.mozilla.org/en-US/docs/Web/API/fetch, only the error condition "If fetch throws a network error." makes sense here (unless we for some reason provided invalid arguments to fetch).

It's not clear whether the network error can be caused by the client, the server, or both. MDN says this, but it could still be both:

A fetch() promise only rejects when a network error is encountered (which is usually when there's a permissions issue or similar)

Here are a few examples from Node.js:

> await fetch('http://127.0.0.1:123/').catch(err => console.log(err.cause?.message ?? err.message))
bad port

> await fetch('http://invalid.example.com/').catch(err => console.log(err.cause?.message ?? err.message))
getaddrinfo ENOTFOUND invalid.example.com

> await fetch('http://127.0.0.1:9999/').catch(err => console.log(err.cause?.message ?? err.message))
connect ECONNREFUSED 127.0.0.1:9999

Let's detect these well-known causes and report them in a machine-readable form in the measurements. For example, we can invent new HTTP status codes for this:

900 - cannot resolve hostname (getaddrinfo ENOTFOUND)
901 - connection refused (connect ECONNREFUSED) Everything else will be reported via networkError to allow us to discover more error cases later.

However: Spark makes fetch requests with ipfs:// URLs; these requests are routed to the Lassie HTTP daemon running inside Zinnia. It's unclear under which conditions would such a request fail with a network error?!

Slack discussion: https://filecoinproject.slack.com/archives/C03PVLW721L/p1701174267028939

bajtos commented 7 months ago

Hmm, this is more complex than we think. AFAICT, spark is always reporting end_at, see here: https://github.com/filecoin-station/spark/blob/a6098496c67d016794dd98af5b08d33bcf6e8157/lib/spark.js#L101

And yet we have measurements where it’s set to Date(0) (see the screenshot):

bajtos commented 7 months ago

In my Station module logs, I see that my SPARK checker is reporting measurements with end_at set to null.

[2023-11-27T17:24:47Z INFO  module:spark/main] Fetching ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F12D3KooW
Biq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8...

[2023-11-27T17:25:47Z ERROR module:spark/main] Failed to fetch ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F1
2D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8
[2023-11-27T17:25:47Z ERROR module:spark/main] DOMException: The signal has been aborted
[2023-11-27T17:25:47Z INFO  module:spark/main] Submitting measurement...
[2023-11-27T17:25:47Z INFO  module:spark/main] {
      sparkVersion: "1.6.0",
      zinniaVersion: "0.15.0",
      cid: "bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a",
      providerAddress: "/ip4/138.113.9.192/tcp/34561/p2p/12D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8",
      protocol: "graphsync",
      timeout: true,
      startAt: 2023-11-27T17:24:47.388Z,
      firstByteAt: null,
      endAt: null,
      carTooLarge: false,
      byteLength: 0,
      carChecksum: null,
      statusCode: null,
      participantAddress: "f410..."
    }

[2023-11-27T17:25:48Z INFO  module:spark/main] Measurement submitted (id: 85714174)

bajtos commented 7 months ago

Over 40% retrievals fail with unknown error. Possibly because fetch('ipfs://…') threw an error, which means there was a network error between the Zinnia module and the built-in Lassie HTTP server. Maybe rusty-lassie crashes on some systems and zinniad does not handle this properly?

bajtos commented 1 week ago

https://github.com/filecoin-station/spark/releases/tag/v1.13.0

juliangruber commented 1 week ago

Thank you for the report here, let's see how things change with the new Spark version?

And indeed it's a bit tricky to infer from the data which error happened exactly

bajtos commented 1 week ago

All changes are live, and the Retrieval Status chart shows the new descriptive result codes now. I am closing this issue as done.

filecoin-station / spark

Report retrieval network errors #43