LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

Process gets killed if several large files are input #68

Closed pirolen closed 1 year ago

pirolen commented 1 year ago

Hi, on large files, the FoLiA-txt tool in the containerized foliautils gets killed. I get:

/data # FoLiA-txt  --remove-end-hyphens yes -O . *.txt
start processing of 22 files 
Processed: 02_feb_car.txt into ./02_feb_car.folia.xml still 21 files to go.
Killed

It is not a big problem since one can call the tool separately per file, but thought to let you know.

Maybe it is better to call the tool per file in a shell script in the container, I did not try that.

proycon commented 1 year ago

I wonder if it's due to the system's OOM killer, were you running out of memory? (though that would imply there's a memory leak if all individual files do work).

pirolen commented 1 year ago

I was trying to search in logs to track down the cause, but did not find a way to identify what happened. Grepping for kill did not return anything on /var/log/dmesg or /var/log/kern.log or /var/log/syslog I have Ubuntu 20, could you advise where to look? Thanks!

kosloot commented 1 year ago

As far as I can see, there is no significant memory leak in FoLiA-txt But maybe there is some strange oddity in the file at hand. I don't know. It seems that the first file is processed OK, but the second isn't.

I assume there is NO problem when that file is processed on its own?

pirolen commented 1 year ago

The files process fine, if I call the converter one by one. I experienced the same thing on other files too, when calling the converter on directories of large files -- there can be near 1 mln tokens per file....

Typically, after having converted the first file, the process is killed.

kosloot commented 1 year ago

Well, I just ran tests on some fairly small files, and here seems to be some random effect which makes the run to fail, but not always. It is currently taking 23,6 Gb of memory, and I will kill it myself, but agreed there is something rotten. Needs some investigation

kosloot commented 1 year ago

OK, it is some multithreading problem I guess. A deadlock occurs. FoLiA-txt seems to 'stall' when running om multiple threads You could try to use the -t1 or --threads=1 option. (which slows down of course)

Best is to upgrade to the newest GIT version, which tells on how many threads you actually run. Good luck

kosloot commented 1 year ago

@pirolen the git master has a fix now, which hopefully fixes the deadlock

kosloot commented 1 year ago

Closing, considering it to be fixed