ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
596 stars 53 forks source link

java.lang.OutOfMemoryError #1189

Closed awccu closed 11 months ago

awccu commented 1 year ago

Describe the bug The program works as expected if my .csv file sizes are ~10MB. However, if my .csv file sizes are larger, I get a java.lang.OutOfMemoryError, and the files are not Indexed. Specifically, the Index is created in Elasticsearch, but there are no documents that are sent into the index.

To Reproduce Steps to reproduce the behavior: We are running this in server mode, dockerized. However, the .yml points to an on-premise ESXi VM running Redis, an on-premise ESXI VM running Elasticsearch, and an on-premise PostgreSQL ESXi VM.

As stated, the setup works and parses the data properly when the file sizes of the .csv file are ~10MB. However, we have many large .csv files. I have pre-processed the large 10GB++ files to a smaller, more reasonable size -- 1GB. However, the OutOfMemoryError persists.

To reproduce the behavior, run the program in server mode, with a separate SCAN job and a separate INDEX job. The SCAN job works fine. The INDEX job fails. Specifically, the index job is ran as follows:

docker compose exec datashare_web /entrypoint.sh --mode CLI --stage INDEX --queueType REDIS --queueName "datashare_3_20230922-1_1gb_csvs:queue2" --redisAddress redis://192.168.1.37:6379 --defaultProject test-project-datashare-3-20230922-1_1gb_csvs --elasticsearchAddress http://192.168.1.54:9200 --dataDir /home/datashare/Datashare/20230922-1_1gb_csvs --ocr false | tee /mnt/datashare_3_source_data/logs/test-project_index_datashare-3-20230922-1_1gb_csvs

Expected behavior I expect the files to be processed and placed into the Elasticsearch index, and made available in the Datashare web page.

Screenshots The appropriate portion of the log identifying the problem is below: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124) at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136) at org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:144) at org.apache.http.impl.io.ContentLengthOutputStream.flush(ContentLengthOutputStream.java:102) at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:964) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484) at org.elasticsearch.client.RestHighLevelClient.performRequestAndHandleResponse(RestHighLevelClient.java:1454) at org.elasticsearch.client.RestHighLevelClient.bulk(RestHighLevelClient.java:477) at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer.bulk(ElasticsearchIndexer.java:68) at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer.add(ElasticsearchIndexer.java:53) at org.icij.datashare.text.nlp.AbstractPipeline.processAndIndex(AbstractPipeline.java:80) at org.icij.datashare.text.nlp.AbstractPipeline.processAndIndex(AbstractPipeline.java:68) at org.icij.datashare.text.nlp.NlpApp.lambda$null$2(NlpApp.java:66) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at org.icij.datashare.text.nlp.NlpApp.lambda$createNlpListener$3(NlpApp.java:66) at org.icij.datashare.text.nlp.NlpApp$$Lambda$82/0x00000008001c4040.onMessage(Unknown Source) at org.icij.datashare.text.nlp.NlpListener.onMessage(NlpListener.java:40) at org.icij.datashare.text.nlp.NlpListener.onMessage(NlpListener.java:16) at org.icij.datashare.com.Channel.publish(Channel.java:23) at org.icij.datashare.com.Message.publish(Message.java:37) at org.icij.datashare.com.RedisMessageQueue.publish(RedisMessageQueue.java:35) at org.icij.datashare.text.nlp.NlpApp.main(NlpApp.java:81)

Desktop (please complete the following information):

Additional context Here is a copy of my .yml file in Datashare (changed some sensitive info):

version: "3.7"
services:

  # This is the main Datashare service that serves the web interface, and controls
  # the processing of the data.
  # creates epehemeral user sessions.

  datashare_web:
    image: icij/datashare:13.0.0
    hostname: datashare
    ports:
      - 8080:8080
    environment:
      - DS_DOCKER_MOUNTED_DATA_DIR=/mnt/datashare_3_source_data
      - JAVA_OPTS=-Xms31g -Xmx40g # Allocate minimum 31GB and maximum 40GB of heap memory to JVM
    volumes:
      - type: bind
        source: /mnt/datashare_3_source_data
        target: /home/datashare/Datashare
    command: >-
      --mode SERVER
      --dataDir /home/datashare/Datashare    
      --authFilter org.icij.datashare.session.YesCookieAuthFilter
      --busType REDIS
      --batchQueueType REDIS
      --dataSourceUrl "jdbc:postgresql://192.168.1.38:5432/datashare?user=datashare&password=obfuscatedpass"
      --defaultProject test-project-datashare-3-20230922-1_1gb_csvs
      --elasticsearchAddress http://192.168.1.54:9200      
      --messageBusAddress redis://192.168.1.37:6379
      --queueType REDIS
      --redisAddress redis://192.168.1.37:6379  
      --rootHost http://localhost:8080
      --sessionStoreType REDIS
      --sessionTtlSeconds 43200
      --tcpListenPort 8080

  datashare_create_project:
    image: icij/datashare:13.0.0
    restart: "no"
    command: >-
      --defaultProject test-project-datashare-3-20230922-1_1gb_csvs
      --mode CLI 
      --stage INDEX 
      --elasticsearchAddress http://192.168.1.54:9200

  datashare_batch_searches:
    image: icij/datashare:13.0.0
    depends_on:
      - datashare_web
    command: >-
      --mode BATCH_SEARCH 
      --batchQueueType REDIS
      --batchThrottleMilliseconds 250
      --busType REDIS
      --dataSourceUrl "jdbc:postgresql://192.168.1.38:5432/datashare?user=datashare&password=obfuscatedpass"
      --defaultProject test-project-datashare-3-20230922-1_1gb_csvs
      --elasticsearchAddress http://192.168.1.54:9200  
      --queueType REDIS
      --redisAddress redis://192.168.1.37:6379
      --scrollSize 500  

  datashare_batch_downloads:
    image: icij/datashare:13.0.0
    depends_on:
      - datashare_web
    volumes:
      - type: bind
        source: /mnt/datashare_3_source_data
        target: /home/datashare/Datashare
      - type: volume
        source: datashare-batchdownload-dir
        target: /home/datashare/app/tmp
        read_only: false
    command: >-
      --mode BATCH_DOWNLOAD 
      --dataDir /home/datashare/Datashare    
      --batchDownloadTimeToLive 336
      --batchQueueType REDIS
      --batchThrottleMilliseconds 250
      --busType REDIS
      --dataSourceUrl "jdbc:postgresql://192.168.1.38:5432/datashare?user=datashare&password=obfuscatedpass"
      --defaultProject test-project-datashare-3-20230922-1_1gb_csvs
      --elasticsearchAddress http://192.168.1.54:9200  
      --queueType REDIS
      --redisAddress redis://192.168.1.37:6379
      --scrollSize 500

volumes:
  datashare-batchdownload-dir:

(Optional) Your contact, availabilities and timezone if a video call with screensharing is needed For any private information, please consider sending an email to datashare@icij.org. I have emailed you my private contact information.

pirhoo commented 1 year ago

Hello,

Can you please reformat your issue with code blocks so we can read it properly?

If you use docker compose we will also need your Docker Compose file.

Pierre

pirhoo commented 1 year ago

You must use "```" around your code blocks:

https://docs.github.com/fr/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code

I see your compose file now, thanks.

awccu commented 1 year ago

Actually, i figured out how to reformat with code blocks. Thank you.


From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:33 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)

I do not know how to reformat it with code blocks. I placed the entire .yml in there. I also placed the command I was using for executing into and running the index job

Get Outlook for iOShttps://aka.ms/o0ukef


From: Pierre Romera Zhang @.> Sent: Friday, September 22, 2023 7:27:51 AM To: ICIJ/datashare @.> Cc: Adam Rutkowski @.>; Author @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)

Hello,

Can you please reformat your issue with code blocks so we can read it properly?

If you use docker compose we will also need your Docker Compose file.

Pierre

— Reply to this email directly, view it on GitHubhttps://github.com/ICIJ/datashare/issues/1189#issuecomment-1731519680, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A46R3XTADP42CT26HTCPR5TX3WN6PANCNFSM6AAAAAA5DF4J5U. You are receiving this because you authored the thread.Message ID: @.***>

awccu commented 1 year ago

An additional piece of potentially relevant info:

All ESXi VMs, if on separate pieces of hardware, are connected via a 10 gigabit network.


From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:36:30 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)

Actually, i figured out how to reformat with code blocks. Thank you.


From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:33 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)

I do not know how to reformat it with code blocks. I placed the entire .yml in there. I also placed the command I was using for executing into and running the index job

Get Outlook for iOShttps://aka.ms/o0ukef


From: Pierre Romera Zhang @.> Sent: Friday, September 22, 2023 7:27:51 AM To: ICIJ/datashare @.> Cc: Adam Rutkowski @.>; Author @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)

Hello,

Can you please reformat your issue with code blocks so we can read it properly?

If you use docker compose we will also need your Docker Compose file.

Pierre

— Reply to this email directly, view it on GitHubhttps://github.com/ICIJ/datashare/issues/1189#issuecomment-1731519680, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A46R3XTADP42CT26HTCPR5TX3WN6PANCNFSM6AAAAAA5DF4J5U. You are receiving this because you authored the thread.Message ID: @.***>

awccu commented 1 year ago

@pirhoo -- any thoughts on this bug???

pirhoo commented 1 year ago

Are you sure the command:

docker compose exec datashare_web /entrypoint.sh --mode CLI --stage INDEX --queueType REDIS --queueName "datashare_3_20230922-1_1gb_csvs:queue2" --redisAddress redis://192.168.1.37:6379 --defaultProject test-project-datashare-3-20230922-1_1gb_csvs --elasticsearchAddress http://192.168.1.54:9200 --dataDir /home/datashare/Datashare/20230922-1_1gb_csvs --ocr false | tee /mnt/datashare_3_source_data/logs/test-project_index_datashare-3-20230922-1_1gb_csvs

Is the one returning the error?

Because the stack trace seems to show the error comes from the NLP pipeline. It the first time we see Java heap space error at this step. Your configuration seems pretty robust so I wonder if there are any other specificity we need to be aware to trouble shot this?

awccu commented 1 year ago

Yes. I'm sure that is the step causing the error. We aren't doing anything complicated, here.

github-actions[bot] commented 12 months ago

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 20 days since being marked as stale.