Closed bajtos closed 1 week ago
Currently, further error details will be printed to the operator's console, in https://github.com/filecoin-station/spark/blob/a6098496c67d016794dd98af5b08d33bcf6e8157/lib/spark.js#L151
Looking at https://developer.mozilla.org/en-US/docs/Web/API/fetch, only the error condition "If fetch throws a network error." makes sense here (unless we for some reason provided invalid arguments to fetch).
It's not clear whether the network error can be caused by the client, the server, or both. MDN says this, but it could still be both:
A
fetch()
promise only rejects when a network error is encountered (which is usually when there's a permissions issue or similar)
Here are a few examples from Node.js:
> await fetch('http://127.0.0.1:123/').catch(err => console.log(err.cause?.message ?? err.message))
bad port
> await fetch('http://invalid.example.com/').catch(err => console.log(err.cause?.message ?? err.message))
getaddrinfo ENOTFOUND invalid.example.com
> await fetch('http://127.0.0.1:9999/').catch(err => console.log(err.cause?.message ?? err.message))
connect ECONNREFUSED 127.0.0.1:9999
Let's detect these well-known causes and report them in a machine-readable form in the measurements. For example, we can invent new HTTP status codes for this:
900
- cannot resolve hostname (getaddrinfo ENOTFOUND)901
- connection refused (connect ECONNREFUSED)
Everything else will be reported via networkError
to allow us to discover more error cases later.However: Spark makes fetch requests with ipfs://
URLs; these requests are routed to the Lassie HTTP daemon running inside Zinnia. It's unclear under which conditions would such a request fail with a network error?!
Slack discussion: https://filecoinproject.slack.com/archives/C03PVLW721L/p1701174267028939
Hmm, this is more complex than we think. AFAICT, spark is always reporting end_at
, see here:
https://github.com/filecoin-station/spark/blob/a6098496c67d016794dd98af5b08d33bcf6e8157/lib/spark.js#L101
And yet we have measurements where it’s set to Date(0)
(see the screenshot):
In my Station module logs, I see that my SPARK checker is reporting measurements with end_at
set to null
.
[2023-11-27T17:24:47Z INFO module:spark/main] Fetching ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F12D3KooW
Biq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8...
[2023-11-27T17:25:47Z ERROR module:spark/main] Failed to fetch ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F1
2D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8
[2023-11-27T17:25:47Z ERROR module:spark/main] DOMException: The signal has been aborted
[2023-11-27T17:25:47Z INFO module:spark/main] Submitting measurement...
[2023-11-27T17:25:47Z INFO module:spark/main] {
sparkVersion: "1.6.0",
zinniaVersion: "0.15.0",
cid: "bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a",
providerAddress: "/ip4/138.113.9.192/tcp/34561/p2p/12D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8",
protocol: "graphsync",
timeout: true,
startAt: 2023-11-27T17:24:47.388Z,
firstByteAt: null,
endAt: null,
carTooLarge: false,
byteLength: 0,
carChecksum: null,
statusCode: null,
participantAddress: "f410..."
}
[2023-11-27T17:25:48Z INFO module:spark/main] Measurement submitted (id: 85714174)
Over 40% retrievals fail with unknown error. Possibly because fetch('ipfs://…')
threw an error, which means there was a network error between the Zinnia module and the built-in Lassie HTTP server. Maybe rusty-lassie crashes on some systems and zinniad does not handle this properly?
Thank you for the report here, let's see how things change with the new Spark version?
And indeed it's a bit tricky to infer from the data which error happened exactly
All changes are live, and the Retrieval Status chart shows the new descriptive result codes now. I am closing this issue as done.
We are seeing measurements where
start_at
andfinished_at
are filled correctly, but:first_byte_at
andend_at
are set tonew Date(0)
status_code
isnull
timeout
isfalse
We think these measurements are created when the request
fetch('ipfs://...')
fails before receiving response headers.It would be great to get more insights into this case.