RocketChat / Rocket.Chat

The communications platform that puts data protection first.
https://rocket.chat/
Other
40.67k stars 10.66k forks source link

Exploding .wt collection file #32307

Open andrew-thought opened 6 months ago

andrew-thought commented 6 months ago

Description:

I have a wired tiger file in /var/snap/rocketchat-server/common that increases 10-20GB per day, currently at 100GB. Our server is running out of disk space constantly.

Would like to trace what it happening here. Our group is not uploading large files and we do not have a large team.

Steps to reproduce:

  1. Navigate to /var/snap/rocketchat-server/common
  2. Check large files in directory
  3. See large .wt file -> -rw------- 1 root root 108925952000 Apr 24 00:27 collection-48--77318xxx37791.wt

Expected behavior:

Expect reasonable database and .wt file sizes considering the amount of data we put into rocketchat

Actual behavior:

Exploding .wt file size

Server Setup Information:

Client Setup Information

Additional context

Relevant logs:

andrew-thought commented 6 months ago

traced it to rocketchat_userDataFiles.chunks

andrew-thought commented 6 months ago

Could be a failed upload? Is there a way to clear out the userDataFiles and userDataFiles.chunks?

reetp commented 6 months ago

Are you using GridFS?

It really isn't up to storing large amounts of file.

You should migrate to local file storage or some form of online storage.

https://github.com/RocketChat/filestore-migrator

andrew-thought commented 6 months ago

I found the issue, one of the users had started a download of content for their user off the server, the download failed, but it was stuck in the cron list and kept initiating every 2 minutes, and I believe the .wt file contained some portion of the failed zip? file, which repeatedly added 250MB every 2 minutes to userDataFiles and userDataFiles.chunks and that .wt. I deleted the cron job and the failed compressed images from the database in userDataFiles and userDataFiles.chunks directly and it resolved the issue.

So possibly this is issue is really related to the failed user content download, but can close otherwise.

reetp commented 6 months ago

Ah excellent and thanks for letting us know!

I'll leave it open for the moment and ask someone to consider this.

david-uhlig commented 6 months ago

We experienced the same issue. This added >150GB of data within 24h until our server ran out of space.

Every ~2 minutes, a new document was added to rocketchat_user_data_files looking like this

{
    _id: 'qcg7KSZGJWqnyPzMn',
    userId: 'SZJmM6FCEnm9W3PL3',
    type: 'application/zip',
    size: 353507086,
    name: '2024-05-15-John%20Doe-qcg7KSZGJWqnyPzMn.zip',
    store: 'GridFS:UserDataFiles',
    _updatedAt: ISODate('2024-05-15T09:36:02.597Z')
}

Also, corresponding documents were added to rocketchat_userDataFiles.chunks and rocketchat_userDataFiles.files. All the added data was related to the same user. The user claims he did not upload or download anything. Our server is set up to accept uploads up to ~10MB. Mind, the size in the document is ~350MB?! We were not able to reproduce the issue after fixing it on the MongoDB side.

We were able to fix the issue, performing the following steps:

  1. Shut down the Rocket Chat container docker compose stop rocketchat
  2. Enter the mongo shell docker compose exec -it mongodb mongosh
  3. Select the rocketchat database: use rocketchat
  4. Find the cause of the problem by looking at the data in rocketchat_user_data_files, rocketchat_userDataFiles.chunks and rocketchat_userDataFiles.files, e.g. db.rocketchat_user_data_files.find().pretty()
  5. For us, it was a single user with a fresh account, so we could simply select his userId. You may need to adjust for your case. You will also need some free disk space, as the mongoDB will keep growing in size while deleting. You might need to execute the delete in batches and free up space with db.runCommand({compact: "rocketchat_userDataFiles.chunks", force: true}) in between.
  6. Consider first creating an index on rocketchat_userDataFiles.chunks. Otherwise, deleting can take much longer.
    db.rocketchat_userDataFiles.chunks.createIndex({"files_id": 1})
  7. Delete the files from the database:
    
    const userId = 'identified-user-id'

var cursor = db.rocketchat_user_data_files.find({userId: userId})

while (cursor.hasNext()) { var file = cursor.next(); db.rocketchat_userDataFiles.chunks.deleteMany({"files_id": file._id }); db.rocketchat_userDataFiles.files.deleteOne({"_id": file._id }); }

db.rocketchat_user_data_files.deleteMany({userId: userId});


8. Compact the collection to free up disk space: `db.runCommand({compact: "rocketchat_userDataFiles.chunks", force: true})`
9. Drop the index: `db.rocketchat_userDataFiles.chunks.dropIndex("files_id_1")`
10. Restart and recreate the Rocket Chat container: `docker compose up -d rocketchat --force-recreate`

### Server Setup Information:

* Version of Rocket.Chat Server: 6.8.0
* Operating System: Ubuntu 22.04
* Deployment Method: Docker
* Number of Running Instances: 1
* DB Replicaset Oplog:
* NodeJS Version:
* MongoDB Version: 5.0.24
david-uhlig commented 6 months ago

Just realized the data in the collections is from the data self-checkout on /account/preferences. So we might be fine just removing all data from the three mentioned collections?

However, since removing the excessive data with the process described above, the self-checkout isn't processed anymore. We get the following error message in the logs, which seems to stem from the deleted data, possibly. It keeps repeating over and over every few minutes.

rocketchat-1  | Error: ENOENT: no such file or directory, stat '/tmp/zipFiles/feefe5d3-7aa5-4c0f-b2c0-98517a42106f.zip'
rocketchat-1  |  => awaited here:
rocketchat-1  |     at Function.Promise.await (/app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
rocketchat-1  |     at server/lib/dataExport/uploadZipFile.ts:12:16
rocketchat-1  |     at /app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40
rocketchat-1  |  => awaited here:
rocketchat-1  |     at Function.Promise.await (/app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
rocketchat-1  |     at server/lib/dataExport/processDataDownloads.ts:232:25
rocketchat-1  |     at /app/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40 {
rocketchat-1  |   errno: -2,
rocketchat-1  |   code: 'ENOENT',
rocketchat-1  |   syscall: 'stat',
rocketchat-1  |   path: '/tmp/zipFiles/feefe5d3-7aa5-4c0f-b2c0-98517a42106f.zip'
rocketchat-1  | }

Any way to fix this?

Edit: Simply creating the file inside the container does the trick and allows RC to continue, e.g. touch /tmp/zipFiles/feefe5d3-7aa5-4c0f-b2c0-98517a42106f.zip

pjv commented 4 months ago

Hi @andrew-thought, above you wrote:

I deleted the cron job and the failed compressed images from the database in userDataFiles and userDataFiles.chunks directly and it resolved the issue.

Can you elaborate on exactly how you deleted the cron job and failed compressed images from the database? I have an RC instance that ate all the free space on my server's hard disk and is currently offline and I need to get it back online asap.

TIA

stevenhfotofix commented 4 months ago

Confirmed this still happening (and the solution @david-uhlig provided worked) on 6.9.0 and 6.9.3 snap installs.

Thank you @david-uhlig for the extremely comprehensive solution to fix this problem!

david-uhlig commented 3 months ago

We experienced this bug again today on Version 6.9.3. We disabled the User Data Download temporarily now. We would appreciate if this gets fixed soon.

reetp commented 3 months ago

This has been referred to the team and joins the queue but no info on a fix. I've asked for an update.

However, it is open source... There is a massive system behind all this with competing priorities.

Obviously PRs always welcome!