Closed awccu closed 11 months ago
Hello,
Can you please reformat your issue with code blocks so we can read it properly?
If you use docker compose
we will also need your Docker Compose file.
Pierre
You must use "```" around your code blocks:
I see your compose file now, thanks.
Actually, i figured out how to reformat with code blocks. Thank you.
From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:33 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)
I do not know how to reformat it with code blocks. I placed the entire .yml in there. I also placed the command I was using for executing into and running the index job
Get Outlook for iOShttps://aka.ms/o0ukef
From: Pierre Romera Zhang @.> Sent: Friday, September 22, 2023 7:27:51 AM To: ICIJ/datashare @.> Cc: Adam Rutkowski @.>; Author @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)
Hello,
Can you please reformat your issue with code blocks so we can read it properly?
If you use docker compose we will also need your Docker Compose file.
Pierre
— Reply to this email directly, view it on GitHubhttps://github.com/ICIJ/datashare/issues/1189#issuecomment-1731519680, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A46R3XTADP42CT26HTCPR5TX3WN6PANCNFSM6AAAAAA5DF4J5U. You are receiving this because you authored the thread.Message ID: @.***>
An additional piece of potentially relevant info:
All ESXi VMs, if on separate pieces of hardware, are connected via a 10 gigabit network.
From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:36:30 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)
Actually, i figured out how to reformat with code blocks. Thank you.
From: Adam Rutkowski @.> Sent: Friday, September 22, 2023 7:33 AM To: ICIJ/datashare @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)
I do not know how to reformat it with code blocks. I placed the entire .yml in there. I also placed the command I was using for executing into and running the index job
Get Outlook for iOShttps://aka.ms/o0ukef
From: Pierre Romera Zhang @.> Sent: Friday, September 22, 2023 7:27:51 AM To: ICIJ/datashare @.> Cc: Adam Rutkowski @.>; Author @.> Subject: Re: [ICIJ/datashare] java.lang.OutOfMemoryError (Issue #1189)
Hello,
Can you please reformat your issue with code blocks so we can read it properly?
If you use docker compose we will also need your Docker Compose file.
Pierre
— Reply to this email directly, view it on GitHubhttps://github.com/ICIJ/datashare/issues/1189#issuecomment-1731519680, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A46R3XTADP42CT26HTCPR5TX3WN6PANCNFSM6AAAAAA5DF4J5U. You are receiving this because you authored the thread.Message ID: @.***>
@pirhoo -- any thoughts on this bug???
Are you sure the command:
docker compose exec datashare_web /entrypoint.sh --mode CLI --stage INDEX --queueType REDIS --queueName "datashare_3_20230922-1_1gb_csvs:queue2" --redisAddress redis://192.168.1.37:6379 --defaultProject test-project-datashare-3-20230922-1_1gb_csvs --elasticsearchAddress http://192.168.1.54:9200 --dataDir /home/datashare/Datashare/20230922-1_1gb_csvs --ocr false | tee /mnt/datashare_3_source_data/logs/test-project_index_datashare-3-20230922-1_1gb_csvs
Is the one returning the error?
Because the stack trace seems to show the error comes from the NLP pipeline. It the first time we see Java heap space error at this step. Your configuration seems pretty robust so I wonder if there are any other specificity we need to be aware to trouble shot this?
Yes. I'm sure that is the step causing the error. We aren't doing anything complicated, here.
This issue is stale because it has been open for 40 days with no activity.
This issue was closed because it has been inactive for 20 days since being marked as stale.
Describe the bug The program works as expected if my .csv file sizes are ~10MB. However, if my .csv file sizes are larger, I get a java.lang.OutOfMemoryError, and the files are not Indexed. Specifically, the Index is created in Elasticsearch, but there are no documents that are sent into the index.
To Reproduce Steps to reproduce the behavior: We are running this in server mode, dockerized. However, the .yml points to an on-premise ESXi VM running Redis, an on-premise ESXI VM running Elasticsearch, and an on-premise PostgreSQL ESXi VM.
As stated, the setup works and parses the data properly when the file sizes of the .csv file are ~10MB. However, we have many large .csv files. I have pre-processed the large 10GB++ files to a smaller, more reasonable size -- 1GB. However, the OutOfMemoryError persists.
To reproduce the behavior, run the program in server mode, with a separate SCAN job and a separate INDEX job. The SCAN job works fine. The INDEX job fails. Specifically, the index job is ran as follows:
Expected behavior I expect the files to be processed and placed into the Elasticsearch index, and made available in the Datashare web page.
Screenshots The appropriate portion of the log identifying the problem is below: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124) at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136) at org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:144) at org.apache.http.impl.io.ContentLengthOutputStream.flush(ContentLengthOutputStream.java:102) at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:964) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484) at org.elasticsearch.client.RestHighLevelClient.performRequestAndHandleResponse(RestHighLevelClient.java:1454) at org.elasticsearch.client.RestHighLevelClient.bulk(RestHighLevelClient.java:477) at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer.bulk(ElasticsearchIndexer.java:68) at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer.add(ElasticsearchIndexer.java:53) at org.icij.datashare.text.nlp.AbstractPipeline.processAndIndex(AbstractPipeline.java:80) at org.icij.datashare.text.nlp.AbstractPipeline.processAndIndex(AbstractPipeline.java:68) at org.icij.datashare.text.nlp.NlpApp.lambda$null$2(NlpApp.java:66) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at org.icij.datashare.text.nlp.NlpApp.lambda$createNlpListener$3(NlpApp.java:66) at org.icij.datashare.text.nlp.NlpApp$$Lambda$82/0x00000008001c4040.onMessage(Unknown Source) at org.icij.datashare.text.nlp.NlpListener.onMessage(NlpListener.java:40) at org.icij.datashare.text.nlp.NlpListener.onMessage(NlpListener.java:16) at org.icij.datashare.com.Channel.publish(Channel.java:23) at org.icij.datashare.com.Message.publish(Message.java:37) at org.icij.datashare.com.RedisMessageQueue.publish(RedisMessageQueue.java:35) at org.icij.datashare.text.nlp.NlpApp.main(NlpApp.java:81)
Desktop (please complete the following information):
Additional context Here is a copy of my .yml file in Datashare (changed some sensitive info):
(Optional) Your contact, availabilities and timezone if a video call with screensharing is needed For any private information, please consider sending an email to datashare@icij.org. I have emailed you my private contact information.