Unstructured-IO / unstructured-api

Apache License 2.0
552 stars 118 forks source link

Memory leak #197

Open jmackie opened 1 year ago

jmackie commented 1 year ago

We're exploring using the unstructured API at work.

We're running quay.io/unstructured-io/unstructured-api:c9b74d4 on a "Pro" (private service) Render instance (i.e. 4GB RAM)

We're using the service to process PDFs with the following parameters strategy=hi_res, pdf_infer_table_structure=true and skip_infer_table_types=[]. We're also using parallel mode via UNSTRUCTURED_PARALLEL_MODE_ENABLED=true (using the defaults for the other environment vars).

We've seen the service fall over several times due to OOM, and looking at metrics it looks as if there are resources not being freed after processing runs.

image

Each spike represents a processing run, with about 10 minutes between each.

awalker4 commented 1 year ago

Hi there, apologies for the late response. We certainly still have work to do on improving memory. We haven't seen a leak yet, but it's possible that batching work like this makes it come on faster. Happy to collaborate on this. First off:

Can you share more details about your workload that we can try to replicate?

awalker4 commented 1 year ago

I'm going to close this as we've made a lot of memory improvements over the last few months. Please feel free to create a new issue if needed!

awalker4 commented 8 months ago

We still have memory issues floating around, so going to reopen this. cc @lambda-science

ill-yes commented 7 months ago

Same here for me. Using v0.0.65

lambda-science commented 4 months ago

image Feel like this is still an issue in v.0.0.74 The RAM grow and grow...

alimoezzi commented 4 months ago

Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.

lambda-science commented 4 months ago

Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.

I strongly recommend using Apache TIKA instead of Unstructured now for PDF processing. It's way faster an efficient in terms of CPU/RAM

Keep Unstructured only for specific task like handling EXCEL/PPTX/.EML files.

This ram issue is happening for litteraly a year without any fix, it's just blowing my mind.

lambda-science commented 4 months ago

Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.

Same behaviour, this is making me crazy. => Graph on the right, orange line. image

CXwudi commented 3 months ago

Is anyone able to confirm if setting a limit in docker compose file like:

services:
  unstructured:
    ...
    deploy:
      resources:
        limits:
          memory: 2.0g # try to prevent mem leak

helps?

I really wanna self-deploy unstructured, but due to this I can't and have to use the serverless API.

awalker4 commented 3 months ago

Hi @CXwudi, I'd recommend setting the MAX_LIFETIME_SECONDS to some long value, per the README here. Since the server is stateless, you can avoid growing the memory too much by killing it periodically and using the restart=always docker param.

I realize this is an ugly workaround and I want to apologize to everyone here that this has been open for so long. I'm going to block some time this week to sort out our leak here.

ManuelAngel99 commented 1 month ago

Hi @awalker4 , any news on this? It's still leaking memory like crazy

vespero89 commented 1 month ago

Hi @awalker4 , any news on this? It's still leaking memory like crazy

same for me! let's wait for news

awalker4 commented 1 month ago

Hi all, I'm now on parental leave at unstructured! There are lots of priorities right now but I'd recommend pinging the team on slack for some more visibility on this: https://short.unstructured.io/pzw05l7