cs3org / wopiserver

A vendor-neutral application gateway compatible with the WOPI specifications.
Apache License 2.0
52 stars 27 forks source link

high memory usage when opening a largish file #134

Closed wkloucek closed 8 months ago

wkloucek commented 10 months ago

We're running the wopiserver in Kubernetes with the wopiserver Helm Chart.

We're using resource limits for memory (https://github.com/cs3org/charts/blob/1f79697713223984ad5ca78d81de838f8cf053f0/wopiserver/values.yaml#L77-L85).

We noticed that the wopiserver will be OOM killed when one opens largish files.

I empirically found those memory limits to work out:

269 MB PDF file -> needs 600Mi memory limit 532 MB PDF file -> needs 1200Mi memory limit

In general it seems like we need to have 2x the file size as memory limit.

Background: wopiserver v10.1.0 using CS3 api with ownCloud InfiniteScale 4.0.1

dj4oC commented 10 months ago

Dear @glpatcern, could you perhaps take a look at this? That would be a great help to us. Thank you very much

glpatcern commented 10 months ago

Hi @dj4oC and @wkloucek, I'll take a look, but I have a question: why would you open a PDF through a WOPI application? (which app do you use?)

I must say that in our experience, that is with MS Office and Collabora, the application engine typically fails to open such large files in the first place. The largest I have tried is a 180 MB pptx file with videos, which did not explode the memory, but I'd have to check with even larger files.

glpatcern commented 10 months ago

Actually, I have a strong suspect. The xrootd interface is properly streaming the file's content, whereas the cs3 interface reads it all (in the requests buffer) and streams it to the WOPI application (likely duplicating that buffer, which would explain the 2x factor you observed).

The solution would then be to stream the content at the HTTP level, as detailed in https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

I can look into that but not immediately. Of course a PR is welcome!

dj4oC commented 10 months ago

Thank you! OnlyOffice supports PDF Annotations with Version 7.5 that's why we open PDF via WOPI. Let me check whether I do find a Python Guru for a first PR

DeepDiver1975 commented 10 months ago

Let me check whether I do find a Python Guru for a first PR

:wave: :rofl:

DeepDiver1975 commented 10 months ago

With writefile we have the same trouble as far as I can tell from the source code. PR will follow on Monday. :wave:

glpatcern commented 10 months ago

Yep, I saw that too, but in that case I think the wopi client (MS) does not stream, does it? How do you do in your php implementation?

DeepDiver1975 commented 9 months ago

Form my perspective a network resource is always a stream of data - regardless of what the wopi client is doing.

The writefile() declaration has to be changed as it expects all data at once. This will be a bigger change.

glpatcern commented 9 months ago

Correct, the signature of writefile would have to be changed. But the question remains, if WOPI clients send all data at once it seems moot to stream it chunk by chunk to the storage (which in itself requires Content-Range headers for HTTP, a server capable to support them, and a rewrite of that logic including failure scenarios), when the buffer is already with the wopiserver.

It turns out that WOPI clients MAY support chunked uploads, but with a different set of APIs that belong to the CSPP Plus program: https://learn.microsoft.com/en-us/microsoft-365/cloud-storage-partner-program/plus/file-transfer/multi-request-file-upload-overview

I guess we leave things as they are...

DeepDiver1975 commented 9 months ago

WOPI clients are performing POST requests. POST headers contain the Content-Length header which tells us the total length. The content is a stream by the technical nature and the Flask request object allows us to access it as stream via https://flask.palletsprojects.com/en/3.0.x/api/#flask.Request.stream

Looking at the implementation of get_data() - it is read from the stream as well.

All we need to do is (same as within readfile()) to iterate over the stream.

glpatcern commented 8 months ago

Reopening until we test #141

glpatcern commented 8 months ago

141 was tested and merged, closing