element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.63k stars 205 forks source link

Synapse can fail to move large media uploads from temporary storage to disk in time before the client disconnects #13009

Open matrixbot opened 11 months ago

matrixbot commented 11 months ago

This issue has been migrated from #13009.


This issue is a distilled version of #12937.

When a file is uploaded to Synapse's media store, Twisted buffers it into a temporary file and gives Synapse an IO handle to it once it's been fully received. The issue is that, with sufficiently large files and slow storage, it's possible that it will take more than 30 seconds to copy the bytes out of that temporary file into the correct location in the media store.

Clients such as Element Web will time out 30 seconds after having sent the last byte. (It also wouldn't seem unusual for a reverse proxy to do something similar.) When this happens, the disconnection causes Twisted to close the temporary file and so Synapse's copy is interrupted.

The error produced in the logs looks as follows:

synapse.http.site - 362 - INFO - POST-8959 - Connection from client lost before response was sent
synapse.http.server - 183 - ERROR - POST-8959 - Failed handle request via 'UploadResource': <XForwardedForRequest at 0x7f5bada94c40 method='POST' uri='/_matrix/media/r0/upload?filename=2.0gb' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 366, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 396, in _async_render
    callback_return = await raw_callback_return
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/upload_resource.py", line 96, in _async_render_POST
    content_uri = await self.media_repo.create_content(
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_repository.py", line 178, in create_content
    fname = await self.media_storage.store_file(content, file_info)
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 94, in store_file
    await self.write_to_file(source, f)
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 101, in write_to_file
    await defer_to_thread(self.reactor, _write_file_synchronously, source, output)
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.9/site-packages/synapse/logging/context.py", line 970, in g
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 312, in _write_file_synchronously
    shutil.copyfileobj(source, dest)
  File "/usr/local/lib/python3.9/shutil.py", line 205, in copyfileobj
    buf = fsrc_read(length)
ValueError: I/O operation on closed file
synapse.http.server - 795 - WARNING - POST-8959 - Not sending response to request <XForwardedForRequest at 0x7f5bada94c40 method='POST' uri='/_matrix/media/r0/upload?filename=2.0gb' clientproto='HTTP/1.1' site='8008'>, already disconnected.

We would benefit overall by writing the bytes straight into a file on the correct filesystem as they're received from the client (either to the correct location, or to a temporary location with a rename operation upon completion). This would also mean that the client doesn't have to 'wait around' for the file copy to happen, since it's been happening during the upload (and in most cases, disk I/O is probably faster than the client's upload speed anyway!). In short: it will reduce upload latency a bit for large files.

(An uglier workaround would be to prevent the file copy from blocking the response, returning the MXC URL before it's even ready and finishing the copy in the background. I don't think it's great to give an MXC URL that's not ready — it could lead to races.)

Either solution probably involves some amount of spelunking into how Twisted manages these temporary files.

nooblag commented 10 months ago

Hi there, just chiming in to follow progress here. Thanks!

towards-a-new-leftypol commented 8 months ago

I don't want the client to time-out, this should be a client-side issue.

Does element on android and ios have these timeouts as well, or just element web?

nooblag commented 8 months ago

@towards-a-new-leftypol Thank you for the note. I am not sure, as I do not use the Android or iOS implementations, but I do know that the issue definitely manifests with the web and desktop app, yes. Perhaps someone else knowledgeable on those platforms can chime in to assist?

YamatoRyou commented 1 month ago

Hi there, just chiming in to follow progress here. Thanks!

I am the earliest initiator of this issue. I recently replaced the hard disk (SSD, connected via SATA2 interface), which improved this problem to a certain extent. According to my long-term observation, the following reasons may cause large file upload failures:

  1. If you try to upload a single file larger than 2 GB, it will definitely fail. Even if a sufficiently large value is configured in homeserver.yaml. The reason is that the data type of the column named media_length in the table local_media_repository in the database has a limitation (integer). The maximum value allowed by this type is 2147483647. This column records the size of each uploaded media, which is approximately equal to 2 GB. The first version I used was 1.49.0, and then it was slowly upgraded to 1.116.0. As of now, the database is still like this;
  2. Poor continuous write performance of the hard disk will also cause upload failures, and even as the hard disk life decays, the size of the file you can successfully upload will decrease over time;
  3. If Synapse If you have configured a reverse proxy in front of , the client_max_body_size must be equal to the allowed upload size you configured, otherwise the upload will fail (HTTP status code 413).

In order to test whether the database is the highest indicator for limiting the upload of large files, I bypassed the client limit and used Shell scripts and curl to test upload large files to Synapse. Finally, I found that the size it can accept is 2 GB anyway.

File info:

en-us_windows_10_iot_enterprise_ltsc_2021_x64_dvd_257ad90f.iso
4.52 GB

Config: homeserver.yaml

max_upload_size: 8192M

Method:

Follow Synapse's API specification
curl -X POST \
    -H "Authorization: Bearer $access_token" \
    -H "Content-Type: application/octet-stream" \
    -F "file_name=@$filepath/$filename" \
    $synapse_url/_matrix/media/v3/upload?filename=$filename

Response:

{"errcode":"M_UNKNOWN","error":"Internal server error"}

Log:

synapse.media.media_repository - 347 - INFO - POST-0 - Stored local media in file '/data/media_store/local_content/uE/fT/xblFXOfAefodsIjZLwcJ'
synapse.storage.txn - 760 - DEBUG - POST-0 - [TXN START] {store_local_media-13f}
synapse.storage.txn - 856 - DEBUG - POST-0 - [TXN FAIL] {store_local_media-13f} integer out of range

synapse.storage.txn - 864 - DEBUG - POST-0 - [TXN END] {store_local_media-13f} 0.013079 sec
synapse.http.server - 146 - ERROR - POST-0 - Failed handle request via 'UploadServlet': <XForwardedForRequest at 0x7fbdccdc35d0 method='POST' uri='/_matrix/media/v3/upload?filename=en-us_windows_10_iot_enterprise_ltsc_2021_x64_dvd_257ad90f.iso' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/http/server.py", line 332, in _async_render_wrapper
    callback_return = await self._async_render(request)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/http/server.py", line 544, in _async_render
    callback_return = await raw_callback_return
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/rest/media/upload_resource.py", line 111, in on_POST
    content_uri = await self.media_repo.create_content(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 922, in _wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/media/media_repository.py", line 349, in create_content
    await self.store.store_local_media(
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 922, in _wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/databases/main/media_repository.py", line 458, in store_local_media
    await self.db_pool.simple_insert(
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1084, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 952, in runInteraction
    return await delay_cancellation(_runInteraction())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 2010, in _inlineCallbacks
    result = context.run(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 549, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 918, in _runInteraction
    result: R = await self.runWithConnection(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1047, in runWithConnection
    return await make_deferred_yieldable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 269, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 285, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
    result = func(conn, *args, **kw)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1040, in inner_func
    return func(db_conn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 780, in new_transaction
    r = func(cursor, *args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1098, in simple_insert_txn
    txn.execute(sql, vals)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 426, in execute
    self._do_execute(self.txn.execute, sql, parameters)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 488, in _do_execute
    return func(sql, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
psycopg2.errors.NumericValueOutOfRange: integer out of range
synapse.access.http.8008 - 473 - INFO - POST-0 - 192.168.3.1 - 8008 - {@**********************************} Processed request: 63.217sec/0.000sec (0.810sec, 43.827sec) (0.000sec/0.013sec/1) 55B 500 "POST /_matrix/media/v3/upload?filename=en-us_windows_10_iot_enterprise_ltsc_2021_x64_dvd_257ad90f.iso HTTP/1.1" "curl/7.54.0" [0 dbevts]