Tree-tagger Processes don't terminate

GoogleCodeExporter commented 9 years ago

Hi,

I'm using DKPro Keyphrases' CooccurrenceGraphExtractor to extract keyphrases 
from various texts. The keyphrase extraction for the texts is
performed sequentially. However, my Windows Task-Manager reports that the 
tree-tagger processes do not terminate. So although I process the texts 
sequentially, a growing number of tree-tagger processes accumulates in my RAM 
until my RAM is used up completely.

My code that invokes the keyphrase extraction looks like this:

For(String text : allTexts){
    CooccurrenceGraphExtractor extractor = new CooccurrenceGraphExtractor();
    extractor.setMinKeyphraseLength(2);
    extractor.setCandidate(new Candidate(CandidateType.Token, PosType.N));
    List<Keyphrase> keyphrases = extractor.extract(text);
    keyphrases = getTopRankedUniqueKeyphrases(keyphrases, keyphrases.size());

    // save text 
    ...
}

Is there a way to avoid this accumulation of tree-tagger processes?
Thanks in advance.

Sincerely yours,
Laura

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Original issue reported on code.google.com by Steinert...@googlemail.com on 8 Jul 2014 at 1:01

GoogleCodeExporter commented 9 years ago

I forgot: I'm using DKPro Keyphrases 1.5.0-SNAPSHOT on a 64 bit Windows 7 
operating system.

Original comment by Steinert...@googlemail.com on 8 Jul 2014 at 1:02

GoogleCodeExporter commented 9 years ago

Hi Laura, I also have a 64-bit Windows 7 operating system and I could see that 
the tree-tagger process was finished after I run this code example. Could you 
try using the stable version on maven central instead of the snapshot version 
and see how it goes?

Original comment by pedrobss...@gmail.com on 9 Jul 2014 at 8:38

GoogleCodeExporter commented 9 years ago

Hi. Where can I find a stable release?
On 
http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-ukp-snapshots-
local/de/tudarmstadt/ukp/dkpro/keyphrases/ I can only find the SNAPSHOT 
versions 1.5.0 and 1.6.0.
I now switched to 1.6.0-SNAPSHOT but the problem remains. Please note that this 
problem only seems to occur when processing MANY texts sequentially. I have a 
file of 22 texts where the problem does not occur. However, a file with 2400 
texts produces that problem.

Original comment by Steinert...@googlemail.com on 9 Jul 2014 at 9:59

GoogleCodeExporter commented 9 years ago

Do all 2400 texts have the same language?

Original comment by richard.eckart on 9 Jul 2014 at 11:22

GoogleCodeExporter commented 9 years ago

Yes, they are all in English.

Original comment by Steinert...@googlemail.com on 9 Jul 2014 at 11:48

GoogleCodeExporter commented 9 years ago

The latest stable version is on maven central: 
http://search.maven.org/#search%7Cga%7C1%7Ckeyphrases

How big is that file? Could you share it with us so we can test it?

Original comment by pedrobss...@gmail.com on 9 Jul 2014 at 11:52

GoogleCodeExporter commented 9 years ago

Here's the file I'm using whereby each line is one text.

Original comment by Steinert...@googlemail.com on 9 Jul 2014 at 12:14

Attachments:

texts_treetagger_problem.txt

GoogleCodeExporter commented 9 years ago

The following sample code produces the problem.

By the way, given the maven central URL for the stable version, how do I add 
that as a repository to my Maven POM?

Original comment by Steinert...@googlemail.com on 9 Jul 2014 at 12:30

Attachments:

Test.java

GoogleCodeExporter commented 9 years ago

For adding the dependency to your project, you don't need to add a repository 
to your pom, just the following tags:

<dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.keyphrases</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.keyphrases.wrappers-gpl</artifactId>
    <version>1.5.0</version>
</dependency>

Original comment by pedrobss...@gmail.com on 9 Jul 2014 at 12:45

GoogleCodeExporter commented 9 years ago

Okay, I switched to verion 1.5.0, but the problem persists. Although I thought 
it worked for my smaller dataset of 22 texts, I now noticed that it's happening 
there, too. These 22 texts are also all english texts.

Original comment by Steinert...@googlemail.com on 10 Jul 2014 at 8:50

GoogleCodeExporter commented 9 years ago

Here's a file with 17 texts (all English) that induces the same problem.

Original comment by Steinert...@googlemail.com on 10 Jul 2014 at 9:01

Attachments:

texts_treetagger_problem_2.txt

GoogleCodeExporter commented 9 years ago

I tested it and the treetagger process was over after ending the pipeline. The 
screenshots attached show the process during the pipeline and after the 
pipeline is gone. I also made a small change to the code, because the 
implementation you did throws a NullPointerException and does not close the 
buffered reader[1]. But perhaps I understood it wrong... are you stating that 
the problem is that various treetagger process are created during the execution 
of the pipeline? Is that the issue?

[1] http://en.wikipedia.org/wiki/Resource_leak

Original comment by pedrobss...@gmail.com on 10 Jul 2014 at 12:02

Attachments:

GoogleCodeExporter commented 9 years ago

My problem is exactly that multiple treetagger processes exist during the 
execution of the program. Although there should always only be one treetagger 
process at any time, they seem to queue up. It starts with just one process but 
over the time there start to appear more (or the older ones don't terminate).

Attached you will find a screenshot that shows multiple treetagger processes 
during the executing of the testprogram.

Original comment by Steinert...@googlemail.com on 11 Jul 2014 at 8:27

Attachments:

screenshot.jpg

GoogleCodeExporter commented 9 years ago

I forgot to mention that they do not even disappear after the process 
terminated.
At least it takes some time for them to disappear...

Original comment by Steinert...@googlemail.com on 11 Jul 2014 at 10:06

GoogleCodeExporter commented 9 years ago

The problem is that you are creating one instance of CoocurrenceGraphExtractor 
at each iteration in the loop.

Original comment by pedrobss...@gmail.com on 22 Jul 2014 at 8:35

Changed state: Invalid

Attachments:

Test.java

GoogleCodeExporter commented 9 years ago

Hi,

Sorry for my late answer but I was on holidays. 

Thanks for pointing out my error, that was a really stupid one. ^_^
However, my problem is not completely solved. I'd really like to do the 
keyphrase extraction multithreaded by using a ThreadPoolExecutor. For that one 
specifies a minimal and maximal number of threads running in parallel. Then one 
simply adds all the threads to the ThreadPoolExecutor and starts the execution.

The great thing is that one does not have to worry about coordinating the 
execution of the threads. The bad thing (in this case) is that I don't know 
which threads are executed in which order. 

Suppose I want to have a maximum of n threads running in parallel. One could 
think that I could simply create n CoocurrenceGraphExtractor instances and 
assign them to the various threads. However, saying that at the beginning n 
threads are running, where thread i uses CoocurrenceGraphExtractor i. But now 
the threads can terminate in an arbitrary order. If thread 2 ends first, the 
slot of the ThreadPoolExecutor  might be filled with a thread using 
CoocurrenceGraphExtractor 1 instead of 2. Then I might have multiple threads 
using the same CoocurrenceGraphExtractor instance with different texts at the 
same time. Surely that would not work.

Do you have any idea in how to compute the keyphrases multithreadedly?

Yours,
Laura

Original comment by Steinert...@googlemail.com on 8 Aug 2014 at 8:05

GoogleCodeExporter commented 9 years ago

Hi,

I think you should create one CooccurrenceGraphExtractor instance for each 
Thread.

Regards,
Pedro

Original comment by pedrobss...@gmail.com on 8 Aug 2014 at 12:01

GoogleCodeExporter commented 9 years ago

Another way is to make just one CoocurrenceGraphExtractor instance and make it 
synchronized, so that all the threads can use the same instance thread-safe.

Original comment by pedrobss...@gmail.com on 8 Aug 2014 at 12:10

GoogleCodeExporter commented 9 years ago

Hi,

creating one instance per thread is what I originally did which
resulted in the opening of this discussion. :(

What do you mean by synchronizing? Making the keyphrase extraction a 
Mutex/Monitor? Hmmm.... my threads esentially do only keyphrase extraction. So 
wouldn't that be as fast as a sequential computation with an additional 
overhead for the mutex control?

Original comment by Steinert...@googlemail.com on 11 Aug 2014 at 10:52

GoogleCodeExporter commented 9 years ago

Well, implement a pool where each thread checks out an instance of the 
CoocurrenceGraphExtractor and to which the thread returns it before it ends. In 
that way you should never have two threads that share the same 
CoocurrenceGraphExtractor.

Original comment by richard.eckart on 11 Aug 2014 at 2:19

crack521 / dkpro-keyphrases

Tree-tagger Processes don't terminate #38