Chocobozzz / PeerTube

ActivityPub-federated video streaming platform using P2P directly in your web browser
https://joinpeertube.org/
GNU Affero General Public License v3.0
12.94k stars 1.49k forks source link

Large file uploads results in 500 #4147

Closed kontrollanten closed 3 years ago

kontrollanten commented 3 years ago

Describe the current behavior When uploading a video file at 3 gb I receive a 500 response in the end (~99.9%). Upon reloading the page the video has been created twice, so the upload did actually work.

Steps to reproduce:

  1. Upload a 3 gb file to an instance that's using s3fs to store the videos.
Chocobozzz commented 3 years ago

How long did it take to upload the file?

kontrollanten commented 3 years ago

Maybe 5-10 minutes. If you'd like a more precise answer I can count.

Update: I did a new upload. After 6 minutes 38% is uploaded.

Update 2: 18 minutes in total.

Update 3: During (or after) the upload the whole site stops responding to all our users, they just meet a 504 timeout. I'm not sure if that issue is connected to this.

rigelk commented 3 years ago

Do you have any logs?

kontrollanten commented 3 years ago

peertube-crash.log

It seems that the site stops to respond before an error is shown in the client. When the upload is at ~99,9% ffmpeg is executed (-y -acodec copy -vcodec libx264 -f mp4 -movflags faststart -max_muxing_queue_size 1024 -map_metadata -1 666 -bufsize 597332 -level:v 3.1 -g:v 50 -hls_time 4 -hls_list_size 0 -hls_playlist_type vod -hls_segment_filename /var/www/peertube/storage/tmp/hls/0de9ff0a8722737-240-fragmented.mp4 -hls_segment_type fmp4 -f hls -hls_flags single_file /var/www/peertube/storage/tmp/hls/240.m3u8) and as long as ffmpeg runs the site is down. On some page loads only the frontend is loaded (no API responses) and sometimes there's just the browsers 503 view.

rigelk commented 3 years ago

What machine are you running this on?

kontrollanten commented 3 years ago

Ubuntu. Our test instance (where the logs comes from) has 1 core and 2 gb RAM and our prod instance has 4 cores and 16 gb RAM. Both are behaving the same way.

kukhariev commented 3 years ago

Chunk appends is killer task for s3fs .

As possible options: disallow chunks in the client (set chunkSize: 0), if possible move resumable-uploads directory from s3fs, custom s3fs storage?

kontrollanten commented 3 years ago

Chunk appends is killer task for s3fs .

As possible options: disallow chunks in the client (set chunkSize: 0), if possible move resumable-uploads directory from s3fs, custom s3fs storage?

/var/www/peertube/storage/tmp is not a s3fs directory, so that shouldn't matter? The whole upload process seems to work well, it's when ffmpeg starts that the site goes down. But I can try to remove s3fs on our test instance and see what happens.

kontrollanten commented 3 years ago

Now I tested without s3fs and you where right, @kukhariev , it worked well. I guess the issue appears when the file is moved from tmp to permanent folder. We don't have this issue when importing videos.

One solution could be to move the file in a job and then notify the client when it's ready. I see two different ways to solve it from a user perspective:

A) A user uploads the file, as soon as the file is uploaded the video will get a "file is being processed" status. During this status the video can't be published, but it can be set to "publish when ready". B) Same scenario as today; the user is seeing the upload process until the file is moved. The actual moving is done by a job and the client will be notified when the job is done, which will result in a "upload finished" for the user.

kontrollanten commented 3 years ago

@Chocobozzz Would a PR implementing alternative B above be accepted?

rigelk commented 3 years ago

If moving a file results in a 500 error, fixing this issue would imply fixing all file moves that are not in a job.

I guess the issue appears when the file is moved from tmp to permanent folder.

Why does this happen only there, and only with s3fs?

kontrollanten commented 3 years ago

I guess the issue appears when the file is moved from tmp to permanent folder.

Why does this happen only there, and only with s3fs?

I don't know. The issue may appear even without s3fs, but then the move is done so quickly that it's hard to determine the downtime.

rigelk commented 3 years ago

If we don't understand the underlying issue, then why rush to make the wrong fix? This needs further investigation before make a PR, @kontrollanten.

kontrollanten commented 3 years ago

Yes, it would be optimal to understand why this happens before solving it, but if we find a way that solves the problem I can't see why it's "wrong fix".

For us this is a big issue and I don't have the competence to investigate this further, but if you have the time and competence it's of course appreciated.

rigelk commented 3 years ago

It's not an optimality problem, you are just fixing the symptoms while at the same time increasing the compexity and thus the technical debt for everyone. I get that this is a big issue for you but it doesn't allow to rush things.

rigelk commented 3 years ago

Potentially related: https://github.com/s3fs-fuse/s3fs-fuse/commit/c692093921cd6e2ed3d89b06cf3e980e8120c9bd

@kontrollanten what version of s3fs are you running? It seems version 1.89 fixed some problems they have with big file transfers.

EDIT: depending on the S3-compatible service you are using, they might put restrictions to file sizes, as well as multipart sizes per request.

kontrollanten commented 3 years ago

Potentially related: s3fs-fuse/s3fs-fuse@c692093

@kontrollanten what version of s3fs are you running? It seems version 1.89 fixed some problems they have with big file transfers.

Thanks. I've tried with 1.89 but it's the same issue.

EDIT: depending on the S3-compatible service you are using, they might put restrictions to file sizes, as well as multipart sizes per request.

I'm using AWS which have a max file size at 5 gb, the file I'm trying with is 3 gb.

kukhariev commented 3 years ago

If after a successful upload last chunk , during video processing the server responds with retryable code such as 504, ngx-uploadx will send a HEAD request and uploadx.upload will call next() again. Logs show that this is not handled.

rigelk commented 3 years ago

video processing the server responds with retryable code such as 504

We don't support head requests at the endpoint, but that would explain the logs indeed.

kukhariev commented 3 years ago

Sorry, 'PUT' .

rigelk commented 3 years ago

Then I guess the reverse proxy, seeing no timely answer from the backend, could send a 504 error. @kontrollanten could you check that code is returned somewhere in your logs (either client or reverse proxy logs)?

Chocobozzz commented 3 years ago

Then @kontrollanten try to add directives in your nginx template to increase your nginx proxy timeout for the resumable upload endpoint.