dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

java.lang.OutOfMemoryError: Java heap space I met this error #54

Open WuDiDaBinGe opened 2 years ago

WuDiDaBinGe commented 2 years ago
java.lang.OutOfMemoryError: Java heap space
    at com.carrotsearch.hppc.Internals.newArray(Internals.java:37)
    at com.carrotsearch.hppc.IntObjectOpenHashMap.allocateBuffers(IntObjectOpenHashMap.java:364)
    at com.carrotsearch.hppc.IntObjectOpenHashMap.expandAndPut(IntObjectOpenHashMap.java:318)
    at com.carrotsearch.hppc.IntObjectOpenHashMap.put(IntObjectOpenHashMap.java:194)
    at org.aksw.palmetto.corpus.lucene.WindowSupportingLuceneCorpusAdapter.requestDocumentsWithWord(WindowSupportingLuceneCorpusAdapter.java:124)
    at org.aksw.palmetto.corpus.lucene.WindowSupportingLuceneCorpusAdapter.requestWordPositionsInDocuments(WindowSupportingLuceneCorpusAdapter.java:102)
    at org.aksw.palmetto.prob.window.BooleanSlidingWindowFrequencyDeterminer.determineCounts(BooleanSlidingWindowFrequencyDeterminer.java:54)
    at org.aksw.palmetto.prob.window.BooleanSlidingWindowFrequencyDeterminer.determineCounts(BooleanSlidingWindowFrequencyDeterminer.java:45)
    at org.aksw.palmetto.prob.AbstractProbabilitySupplier.getProbabilities(AbstractProbabilitySupplier.java:37)
    at org.aksw.palmetto.DirectConfirmationBasedCoherence.calculateCoherences(DirectConfirmationBasedCoherence.java:87)
    at org.aksw.palmetto.webapp.PalmettoApplication.calculate(PalmettoApplication.java:198)
    at org.aksw.palmetto.webapp.PalmettoApplication.npmiService(PalmettoApplication.java:111)
    at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:176)
    at org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:440)
    at org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:428)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:933)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:842)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:106)

When i using multi thread to get topic cohrence i met this issue. I ram is 16gb , intel-i9

MichaelRoeder commented 2 years ago

In general, this behavior is expected if you try to use many threads that evaluate different topics in parallel.

The problem is that window-based coherence measures need to know the positions of the single words within documents. If you have words that occur often, the program has to handle many positions at the same time. If you do that in parallel with different topics that have different words, it is not very surprising that the program runs out of memory :wink:

It is hard to give you a hint without more information.

WuDiDaBinGe commented 2 years ago

In general, this behavior is expected if you try to use many threads that evaluate different topics in parallel.

The problem is that window-based coherence measures need to know the positions of the single words within documents. If you have words that occur often, the program has to handle many positions at the same time. If you do that in parallel with different topics that have different words, it is not very surprising that the program runs out of memory

It is hard to give you a hint without more information.

  • How do you have parallelized the workflow (i.e., what is the task of a single thread)
  • How many threads do you use?
  • How many topics do you try to evaluate?
  • How many top words does one of your topics have?

Thanks for you replying. I use three threads to I use three threads to calculate c_a, c_p and npmi respectively. I send the same data to three threads. The topic number is 100 and each topic has top 10 words to evaluate. Topics_words is a topic-words matrix. In my case, his size is (100,10).

def calculate_coherence(word_list, ret, coherence_type):
    result = []
    for words in word_list:
        result.append(palmetto.get_coherence(words, coherence_type=coherence_type))
    ret[coherence_type] = result
    return
th_ca   = threading.Thread(target=calculate_coherence, args=[topic_words, ret, 'ca'], name='th_ca')
th_cp   = threading.Thread(target=calculate_coherence, args=[topic_words, ret, 'cp'], name='th_cp')
th_npmi = threading.Thread(target=calculate_coherence, args=[topic_words, ret, 'npmi'], name='th_npmi')

I have relieve this problem by running this code "export CATALINA_OPTS="-Xms512m -Xmx3072m -XX:-UseGCOverheadLimit" before "mvn org.apache.tomcat.maven:tomcat7-maven-plugin:2.2:run -Dmaven.tomcat.port=7777" It works useful when topic num is 75. But when topic num is 100, i often met the problrm-- "Aborted (core dumped)"

MichaelRoeder commented 2 years ago

Your setup looks good and should work. I am just wondering why you have -Xmx3072m in the options as it limits the server to use not more than 3GB of RAM. You may want to increase it and try it again.

Another workaround would be to split up the list of documents and restart the server in-between. But that is a very bad solution :wink:

We are aware of the problem that the web service sometimes has issues in budgeting its memory. Until now, it is unclear which part of the server creates the problem since the Palmetto library runs without memory issues if it is executed as a plain Java program.

WuDiDaBinGe commented 2 years ago

Ok. In will increase "-Xmx" again. I use python-Palmetto,so i don't try Palmetto java library.Maybe i will try next time.Thanks.