Open jmackie opened 1 year ago
Hi there, apologies for the late response. We certainly still have work to do on improving memory. We haven't seen a leak yet, but it's possible that batching work like this makes it come on faster. Happy to collaborate on this. First off:
unstructured-api
? We've had a number of fixes in the last month that should be relevant, namely caching the layout models and reducing the number of images generated for tesseract.UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB
to control some of the OOM kills for now. Try setting this so that we'll reject new documents when memory is low. Can you share more details about your workload that we can try to replicate?
I'm going to close this as we've made a lot of memory improvements over the last few months. Please feel free to create a new issue if needed!
We still have memory issues floating around, so going to reopen this. cc @lambda-science
Same here for me. Using v0.0.65
Feel like this is still an issue in v.0.0.74
The RAM grow and grow...
Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.
Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.
I strongly recommend using Apache TIKA instead of Unstructured now for PDF processing. It's way faster an efficient in terms of CPU/RAM
Keep Unstructured only for specific task like handling EXCEL/PPTX/.EML files.
This ram issue is happening for litteraly a year without any fix, it's just blowing my mind.
Unfortunately, the memory keeps growing with new requests without ever freeing allocated memory.
Same behaviour, this is making me crazy. => Graph on the right, orange line.
Is anyone able to confirm if setting a limit in docker compose file like:
services:
unstructured:
...
deploy:
resources:
limits:
memory: 2.0g # try to prevent mem leak
helps?
I really wanna self-deploy unstructured, but due to this I can't and have to use the serverless API.
Hi @CXwudi, I'd recommend setting the MAX_LIFETIME_SECONDS
to some long value, per the README here. Since the server is stateless, you can avoid growing the memory too much by killing it periodically and using the restart=always
docker param.
I realize this is an ugly workaround and I want to apologize to everyone here that this has been open for so long. I'm going to block some time this week to sort out our leak here.
Hi @awalker4 , any news on this? It's still leaking memory like crazy
Hi @awalker4 , any news on this? It's still leaking memory like crazy
same for me! let's wait for news
Hi all, I'm now on parental leave at unstructured! There are lots of priorities right now but I'd recommend pinging the team on slack for some more visibility on this: https://short.unstructured.io/pzw05l7
We're exploring using the unstructured API at work.
We're running
quay.io/unstructured-io/unstructured-api:c9b74d4
on a "Pro" (private service) Render instance (i.e. 4GB RAM)We're using the service to process PDFs with the following parameters
strategy=hi_res
,pdf_infer_table_structure=true
andskip_infer_table_types=[]
. We're also using parallel mode viaUNSTRUCTURED_PARALLEL_MODE_ENABLED=true
(using the defaults for the other environment vars).We've seen the service fall over several times due to OOM, and looking at metrics it looks as if there are resources not being freed after processing runs.
Each spike represents a processing run, with about 10 minutes between each.