Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.42k stars 115 forks source link

[Colossus] uploads failing #5081

Closed kdembler closed 6 months ago

kdembler commented 7 months ago

This is investigation of upload failures we've been getting reports about lately.

From user reports, I have heard that they had uploads just abort mid-flight. Users have reported seeing CORS issues in the browser logs.

From my testing, I have had a case when the upload failed mid-flight and I got 408 response from the reverse proxy. This would also result in a CORS error as mentioned above.

yasiryagi commented 7 months ago

I had test this and looked in the logs.

Affected storage providers: all image

Geo location: all image

Client type: all

Observed when this issue happens the reverse proxy receives a reset from the peer i.e. Colossus and respond to the client with 408.

{"level":"error","ts":1708069037.667749,"logger":"http.handlers.reverse_proxy","msg":"reading from backend","error":"read tcp 172.21.0.7:33060->172.21.0.15:3333: read: connection reset by peer"}
{"level":"error","ts":1708069037.6678162,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response"," #error":"read tcp 172.21.0.7:33060->172.21.0.15:3333: read: connection reset by peer"}

Below is samples of the error

This error is the majority, sometimes a company by other errors (included the StorageObject affected)

dataObjectId: 2100687
Internal Server Error: Upload aborted
    at error (/joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:123:20)
    at /joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:16:26
    at Immediate._onImmediate (/joystream/node_modules/multer/lib/make-middleware.js:53:37)
    at process.processImmediate (node:internal/timers:478:21)
dataObjectId: 1321165
uncaughtException: Upload aborted
Internal Server Error: Upload aborted
    at error (/joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:123:20)
    at /joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:16:26
    at Immediate._onImmediate (/joystream/node_modules/multer/lib/make-middleware.js:53:37)
    at process.processImmediate (node:internal/timers:478:21)

Below a companied by another error (not on the same log)

dataObjectId: 2112736
uncaughtException: Upload aborted
Internal Server Error: Upload aborted
    at error (/joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:123:20)
    at /joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:16:26
    at Immediate._onImmediate (/joystream/node_modules/multer/lib/make-middleware.js:53:37)
    at process.processImmediate (node:internal/timers:478:21)

Error: Unexpected end of JSON input
    at new ApolloError (/joystream/node_modules/@apollo/client/errors/errors.cjs:34:28)
    at /joystream/node_modules/@apollo/client/core/core.cjs:1654:19
    at both (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:986:53)
    at /joystream/node_modules/@apollo/client/utilities/utilities.cjs:979:72
    at new Promise (<anonymous>)
    at Object.then (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:979:24)
    at Object.error (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:987:49)
    at notifySubscription (/joystream/node_modules/zen-observable/lib/Observable.js:140:18)
    at onNotify (/joystream/node_modules/zen-observable/lib/Observable.js:179:3)
    at SubscriptionObserver.error (/joystream/node_modules/zen-observable/lib/Observable.js:240:7)
dataObjectId: 1713397
uncaughtException: Timeout
Error
    at new module.exports.TimeoutError (/joystream/node_modules/promise-timeout/index.js:44:16)
    at module.exports.timeout (/joystream/node_modules/promise-timeout/index.js:20:15)
    at extrinsicWrapper (/joystream/storage-node/lib/services/runtime/extrinsics.js:207:45)
    at acceptPendingDataObjects (/joystream/storage-node/lib/services/runtime/extrinsics.js:115:18)
    at uploadFile (/joystream/storage-node/lib/services/webApi/controllers/filesApi.js:99:57)

dataObjectId: 2112361
Error: Server is not accepting uploads into this bucket
    at validateUploadFileParams (/joystream/storage-node/lib/services/webApi/app.js:199:15)
    at app.use.OpenApiValidator.middleware.apiSpec (/joystream/storage-node/lib/services/webApi/app.js:59:13)
    at Layer.handle [as handle_request] (/joystream/node_modules/express/lib/router/layer.js:95:5)
    at trim_prefix (/joystream/node_modules/express/lib/router/index.js:317:13)
    at /joystream/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/joystream/node_modules/express/lib/router/index.js:335:12)
    at next (/joystream/node_modules/express/lib/router/index.js:275:10)
    at /joystream/storage-node/lib/services/webApi/app.js:54:9
    at Layer.handle [as handle_request] (/joystream/node_modules/express/lib/router/layer.js:95:5)
    at trim_prefix (/joystream/node_modules/express/lib/router/index.js:317:13)
dataObjectId: 2112588
Internal Server Error: Upload aborted
    at error (/joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:123:20)
    at /joystream/node_modules/express-openapi-validator/dist/middlewares/openapi.multipart.js:16:26
    at Immediate._onImmediate (/joystream/node_modules/multer/lib/make-middleware.js:53:37)
    at process.processImmediate (node:internal/timers:478:21)   

Error: File multihash doesn't match the data object's ipfsContentId for data object ID = 2112588
    at verifyDataObjectInfo (/joystream/storage-node/lib/services/webApi/controllers/filesApi.js:225:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async uploadFile (/joystream/storage-node/lib/services/webApi/controllers/filesApi.js:109:9)

Other errors seem

Error: Unexpected end of JSON input
    at new ApolloError (/joystream/node_modules/@apollo/client/errors/errors.cjs:34:28)
    at /joystream/node_modules/@apollo/client/core/core.cjs:1654:19
    at both (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:986:53)
    at /joystream/node_modules/@apollo/client/utilities/utilities.cjs:979:72
    at new Promise (<anonymous>)
    at Object.then (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:979:24)
    at Object.error (/joystream/node_modules/@apollo/client/utilities/utilities.cjs:987:49)
    at notifySubscription (/joystream/node_modules/zen-observable/lib/Observable.js:140:18)
    at onNotify (/joystream/node_modules/zen-observable/lib/Observable.js:179:3)
    at SubscriptionObserver.error (/joystream/node_modules/zen-observable/lib/Observable.js:240:7)
dataObjectId: 2011371  (Only this)
uncaughtException: File size doesn't match the data object's size for data object ID = 2011371
Error: File size doesn't match the data object's size for data object ID = 2011371
    at verifyDataObjectInfo (/joystream/storage-node/lib/services/webApi/controllers/filesApi.js:221:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async uploadFile (/joystream/storage-node/lib/services/webApi/controllers/filesApi.js:109:9)
dataObjectId: 2005104
uncaughtException: Data object 2005104 already exists
Error: Data object 2005104 already exists
    at validateUploadFileParams (/joystream/storage-node/lib/services/webApi/app.js:216:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
ignazio-bovo commented 6 months ago

Upload error diagnostic 🔍

Route for a request through a openapi middleware authentication

  1. Client sends a multipart/form-data request: This is a type of HTTP request that's used for sending binary data in the message body. It's commonly used for uploading files.

  2. Request reaches the server: The server receives the request and passes it to the Express.js middleware pipeline.

  3. Request hits the OpenApiValidator.middleware: This middleware is configured to validate incoming requests against an OpenAPI specification. The OpenAPI specification is a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection.

  4. Middleware validates the request: The middleware checks if the request matches the paths, methods, and parameters defined in the OpenAPI specification. If the request doesn't match the specification, the middleware will return an error response.

  5. Middleware checks if the request includes a file upload: If the request method, path, and parameters are valid, the middleware checks if the request includes a file upload. This is determined by the Content-Type header of the request, which should be multipart/form-data for file uploads.

  6. Middleware handles the file upload: If the request includes a file upload, the middleware handles the file according to the fileUploader configuration. This configuration uses the diskStorage._handleFile method from Multer (via a standard configuration), which means the file will be stored on disk. The destination property specifies the directory where the file should be stored.

  7. Middleware applies file upload limits: The limits property is used to control the size and number of files that can be uploaded. In your case, you're limiting it to one file with a maximum size specified by config.maxFileSize.

  8. Middleware passes control to the next middleware function: After handling the file upload, the middleware calls next(), which passes control to the next middleware function in the stack. This could be another validation middleware, a route handler, or an error handling middleware.

    fileuploader middleware

    Ref: https://github.com/Joystream/joystream/blob/e716e1d4beae5f00de05cd404015568643d68a14/storage-node/src/services/multer-storage/disk.ts#L46-94

  9. It first checks if the request has been aborted before the file handling process starts. If so, it returns an error with error Upload aborted early

  10. It then calls the getDestination method to determine the directory where the uploaded file should be stored. If there's an error, it's passed to the callback function.

  11. If the destination is successfully retrieved, it calls the getFilename method to generate a unique filename for the uploaded file. If there's an error or the filename is blank, it's passed to the callback function.

  12. If the filename is successfully generated, it creates a write stream to the destination path and pipes the file's read stream into it.

  13. It sets up error and finish event handlers for the write stream. If there's an error writing the file, it deletes the file and passes the error to the callback function. If the write stream finishes successfully, it calls the callback function with details about the uploaded file.

  14. It also sets up an 'aborted' event handler for the request. If the request is aborted (more on this below), it closes the write stream, deletes the file, and passes an error to the callback function. This also seems to be the most common error as reported above with error message Upload aborted

Possible scenarios for which an abort event is triggered on the user side

difficulty in reproducing the case in between () braces

  1. User manually cancels the upload: If the user is uploading a file through a web interface, they might have the option to cancel the upload. If they choose to do this, an abort event will be triggered. (1/5)
  2. User navigates away from the page: If the user is uploading a file and they navigate away from the page (or close the browser/tab), the upload will be cancelled and an abort event will be triggered. (1/5)
  3. Network connection is lost: If the user's network connection is lost during the upload, the client will not be able to continue sending the request to the server, and an abort event will be triggered. (?/5)
  4. Client times out: If the server takes too long to respond, the client might time out and abort the request. The amount of time the client will wait before timing out can vary depending on the client's configuration. (?/5)
  5. Client encounters an error: If the client encounters an error while preparing the request (for example, if there's not enough disk space to read the file into memory), it might abort the request. (5/5)
  6. Client's system goes into sleep or hibernation mode: If the client's system goes into sleep or hibernation mode while the file is being uploaded, the network connection might be interrupted, causing the client to abort the request. (4/5)
  7. Browser tab or application crash: If the browser tab or application crashes during the upload, the request will be aborted. (4/5)
ignazio-bovo commented 6 months ago

Sorry for the confusion above, the issue i.e. error code 408 (at least) is not related to any bugs in colossus.

Doing a quick search on GPT

Default Request Timeout Express itself does not have a built-in request timeout mechanism for closing requests that take too long to complete. By default, Node.js's HTTP server (upon which Express is built) does not enforce a request timeout. This means that if not explicitly set, a request could theoretically hang indefinitely if the client doesn't close the connection and the server doesn't end the response. However, for Node.js versions 13.0.0 and later, the default timeout of the underlying HTTP server is set to 5 minutes (300 seconds). This is a change from earlier versions where there was no default timeout. It's important to note that this timeout doesn't kill the request but rather closes the socket if no activity is detected in that timeframe, which could affect long-polling requests or similar patterns.

I have tried uploading a small video that usually succeeds without throttling enabled. I have set the throttling to 3G slow and the upload fails after 5.1 min

image

The error on the server side (colossus 4.0) is this:

024-03-07 17:19:48:1948 error: POST /api/v1/files: Error 200: Upload aborted

The experience is consistent: same error appears and timing appears on 3 different attempt (all with slightly different timeout intervals but all of them above 5 minutes)

Possible solution

One possible server-side solution to this is apparently to have a custom timeout settings along the following line:

// Set a timeout limit for all requests
app.use(timeout('600s')); // Adjust '10 mins' as needed

But I am not aware of possible implications for this. @zeeshanakram3 @mnaamani Feel free to provide feedback

ignazio-bovo commented 6 months ago

I hade a private chat with @yasiryagi on Discord and he suggests the following approach:

408 is thrown by caddy because of the server closing the connection. So can we expose colossus port and test on colossus directly. It need to issue the metadata and make sure your channel is assign to the testing server only

ignazio-bovo commented 6 months ago

I have tested it locally and it fails imagehowever on local gleev deployment it restarts from where it has left immediately without 408 response

Update On the second attempt it actually fails after about 5 mins with a 408 response (as you can seet on the time fields is 10 mins)

image

image

Solution progress

Setting Axios request timeout on atlas doesn't appear to work image

Setting on storage-server/server.ts

      const server = app.listen(port, () => logger.info(`Listening on http://localhost:${port}`))
      server.setTimeout(900000)

doesn't work either

zeeshanakram3 commented 6 months ago

So just did some investigation, NodeJS HTTP server indeed abort requests with default timeout of 300 secs.

Now the reason why server.setTimeout(900000) did not work was because it will only happen if the idle request exceeded this timeout. If the request was not idle then server.requestTimeout value would checked.

kdembler commented 6 months ago

Fixed in #5094, great job on this investigation @ignazio-bovo and thanks for the fix @zeeshanakram3