alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
1.97k stars 263 forks source link

High resource usage problem. Memory leak #3782

Open perrfect opened 3 weeks ago

perrfect commented 3 weeks ago

Describe the bug I have the physical server without virtualization and run Aleph via docker compose. Server resources: RAM 512 Gb CPU 128 LVM 10 Tb

I'm trying to add files to the server (near by 40 Gb). After some time the upload process crashes, because the server used all resources. Via top I see the process soffice.bin which uses all my ram.

To Reproduce

  1. Run Aleph via docker compose.
  2. Try to upload different types of data (xls, xlsx, pdf, zip etc.)

Expected behavior The upload process should complete successfully and soffice.bin should not use all server resources.

Aleph version 3.15.0

Additional context

The docker-compose.yml file

version: "3.2"

services:

  postgres:
    image: postgres:10.0
    env_file: ./aleph.env
    command: postgres -c 'max_connections=2000'
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: always

  elasticsearch:
    image: ghcr.io/alephdata/aleph-elasticsearch:3bb5dbed97cfdb9955324d11e5c623a5c5bbc410
    hostname: elasticsearch
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=false
      - "ES_JAVA_OPTS=-Xms16g -Xmx16g"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    restart: always
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65535
        hard: 65535

  redis:
    image: redis:alpine
    restart: always
    command: [ "redis-server", "--save", "3600", "10" ]
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data

  ingest-file:
    image: ghcr.io/alephdata/ingest-file:3.19.1
    restart: always
    tmpfs:
      - /tmp:mode=777
    volumes:
      - archive-data:/data
    depends_on:
      - postgres
      - redis
    env_file: ./aleph.env

  worker:
    image: ghcr.io/alephdata/aleph:${ALEPH_TAG:-3.15.0}
    restart: always
    command:
      - /bin/bash
      - -c
      - |
        aleph upgrade
        aleph worker
    depends_on:
      - postgres
      - elasticsearch
      - redis
      - ingest-file
    tmpfs:
      - /tmp
    volumes:
      - archive-data:/data
    env_file: ./aleph.env

  shell:
    image: ghcr.io/alephdata/aleph:${ALEPH_TAG:-3.15.0}
    command: /bin/bash
    depends_on:
      - postgres
      - elasticsearch
      - redis
      - ingest-file
      - worker
    tmpfs:
      - /tmp
    volumes:
      - archive-data:/data
      - "./mappings:/aleph/mappings"
      - "~:/host"
    env_file: ./aleph.env

  api:
    image: ghcr.io/alephdata/aleph:${ALEPH_TAG:-3.15.0}
    restart: always
    expose:
      - 8000
    ports:
      - "8000:8000"
    depends_on:
      - postgres
      - elasticsearch
      - redis
      - worker
      - ingest-file
    tmpfs:
      - /tmp
    volumes:
      - archive-data:/data
    env_file: ./aleph.env

  ui:
    image: ghcr.io/alephdata/aleph-ui-production:${ALEPH_TAG:-3.15.0}
    restart: always
    depends_on:
      - api
    ports:
      - "8080:8080"

volumes:
  archive-data: {}
  postgres-data: {}
  redis-data: {}
  elasticsearch-data: {}

aleph.env file


# Aleph environment configuration
#
# This file is loaded by docker-compose and transformed into a set of
# environment variables inside the containers. These are, in turn, parsed
# by aleph and used to configure the system.

POSTGRES_USER=test
POSTGRES_PASSWORD=test
POSTGRES_DATABASE=test

# Random string:
ALEPH_SECRET_KEY=some_secret

# Visible instance name in the UI
ALEPH_APP_TITLE=Aleph
ALEPH_APP_NAME=aleph
ALEPH_UI_URL=http://10.10.10.10:8080/

ALEPH_ADMINS=admin@example.com

ALEPH_SINGLE_USER=false

ARCHIVE_TYPE=s3
ARCHIVE_BUCKET=aleph
AWS_ACCESS_KEY_ID=some_key_id
AWS_SECRET_ACCESS_KEY=some_secret_key
ARCHIVE_ENDPOINT_URL=http://10.10.10.20:9000
AWS_SECURE=false

ELASTICSEARCH_TLS_VERIFY_CERTS=0

ALEPH_OCR_DEFAULTS=eng

ALEPH_DEBUG=true

LOG_FORMAT=JSON  # TEXT or JSON

PROMETHEUS_ENABLED=true
PROMETHEUS_MULTIPROC_DIR=/data

WORKER_THREADS=0
stchris commented 1 week ago

Hi @perrfect ! Thanks for reporting this. Are you able to reproduce this problem with the latest stable release of Aleph? At this time that would be 3.17.0