LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

"Process `ticclunk (1)` terminated with an error exit status (134)" from ticcl.nf #39

Closed willstout closed 6 years ago

willstout commented 6 years ago

I am getting an error when I try to run ticcl.nf with a folia.xml file I got from ocr.nf. I'm being led to believe this is an issue with the corpus.wordfreqlist.tsv file. When I omit the optional parameter --corpusfreqlist I get

lamachine@0085222b6173:~$ ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ticcl.nf` [dreamy_colden] - revision: 3bd4e988b7
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[05/df5926] Submitted process > corpusfrequency (1)
[ef/85c258] Submitted process > corpusfrequency (2)
[af/61c201] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'ticclunk (1)'

Caused by:
  Process `ticclunk (1)` terminated with an error exit status (134)

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Command exit status:
  134

Command output:
  (empty)

Command error:
  terminate called after throwing an instance of 'std::runtime_error'
    what():  creating UniFilter: default_filter failed
  error in rules, line=-1 at postion: -1
  .command.sh: line 8:   551 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Work dir:
  /home/lamachine/work/af/61c20118a7986818b0285e321ea881

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details

and when I include --corpusfreqlist I get

lamachine@0085222b6173:~$ ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion --corpusfreqlist /home/lamachine/corpus.wordfreqlist.tsv
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ticcl.nf` [grave_dalembert] - revision: 3bd4e988b7
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[27/88234a] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'ticclunk (1)'

Caused by:
  Process `ticclunk (1)` terminated with an error exit status (134)

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Command exit status:
  134

Command output:
  (empty)

Command error:
  terminate called after throwing an instance of 'std::runtime_error'
    what():  creating UniFilter: default_filter failed
  error in rules, line=-1 at postion: -1
  .command.sh: line 8:   637 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Work dir:
  /home/lamachine/work/27/88234a21c792567fc01b34424bc2e3

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

The error appears to be rooted in the corpus.wordfreqlist.tsv still. My few guesses are that eng.aspell.dict doesn't include some words that appear in corpus.wordfreqlist.tsv, but I'm not sure what kind of clean up program wouldn't account for incorrectly spelled words so I don't think that would be the case.

Another issue I see is that some of the words I have in my wordfreqlist begin with numbers and the program doesn't know how to account for cases like "13" where there was one "3" in the PDF read into ocr.nf, this, among several other similar cases, could be problematic.

A final idea that I have is that when I got the wordfreqlist I began to look through it just to see what it was, and the most frequent word in the file "the" did not have a frequency attached to it. It should have appeared "9the". When I noticed this I attempted to fix it and input the corrected (or so I am led to believe) corpus.wordfreqlist.tsv with the --corpusfreqlist parameter. This was the second example I included and it still didn't work so I don't know what's going wrong.

willstout commented 6 years ago

Additionally, when I just take the ocr'd text from my folia.xml file and input that into ticcl.nf, I still get issues.

lamachine@becdcaff28f3:~$ ticcl.nf --inputtype text --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ticcl.nf` [zen_wing] - revision: 3bd4e988b7
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[51/91b433] Submitted process > txt2folia (1)
[14/35fefc] Submitted process > txt2folia (3)
[b7/4b63ab] Submitted process > txt2folia (2)
[97/741d65] Submitted process > corpusfrequency (1)
[e8/11059c] Submitted process > corpusfrequency (2)
[6d/7ba2bf] Submitted process > txt2folia (5)
[05/31d1c2] Submitted process > txt2folia (4)
[05/2db876] Submitted process > txt2folia (6)
ERROR ~ Error executing process > 'txt2folia (4)'

Caused by:
  Missing output file(s) `dependency_links.folia.xml` expected by process `txt2folia (4)`

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  FoLiA-txt --class OCR -t 1 -O . "dependency_links.txt"

Command exit status:
  0

Command output:
  (empty)

Command error:
  nu useful data found in document:'dependency_links'
  skipped!

Work dir:
  /home/lamachine/work/05/31d1c24f45749ac0ad9ad3f422a203

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

And just to see what would happen if I included a corpus.wordfreqlist.tsv file with --corpusfreqlist, I got the errors I was talking about in the comment above this

proycon commented 6 years ago

I hope @martinreynaert can help you with this, he has the knowledge about the actual individual models.

kosloot commented 6 years ago

Well... The error message in the original entry here is quite clear:

Command error:
  terminate called after throwing an instance of 'std::runtime_error'
    what():  creating UniFilter: default_filter failed
  error in rules, line=-1 at postion: -1
  .command.sh: line 8:   551 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

This is fatal indeed. Unfortunately the cause is maybe hard to find. It might be an ICU library version problem, maybe related to the underlying OS. I am not aware of any LaMachine installation where this fails too. So what is your setup?

Longer explanation: TICCL-unk uses the ICU Transliterator class to filter out, or replace slack. It uses a simple C++ wrapper called UniFilter. This filter can be initialized from a file or a string. TICCL-unk provides a default string, which is used and working on all platforms I am aware of.

As a last resort you could disable this filter, by creating an empty file and provide that to TICCL-unk using --filter='name_of_the_empty_file' This allows the program to run, but of course it will not behave as good as desired.

Messing around with other parameters will not solve this problem!

see the usage() or the man page of TICCL-unk for more information on the parameters.

(the man pages may not be found directly. @proycon knows how to fix this) @martinreynaert having some documentation of the modules, especially how they are related and used, would be VERY helpful.

willstout commented 6 years ago

I'm running Windows 10 and have Docker as my container. I've updated Lamachine to include all NLP packages except alpino, tensorflow, and kaldi.

The filter file type wasn't specified so I used what I imagined to be correct, a text file. I got the same error.

lamachine@fbd9f7c9a01c:~$ ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion --filter /home/lamachine/empty_filter.txt
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ticcl.nf` [peaceful_kilby] - revision: 3bd4e988b7
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[f3/b8dd7f] Submitted process > corpusfrequency (1)
[a3/d40f1d] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'ticclunk (1)'

Caused by:
  Process `ticclunk (1)` terminated with an error exit status (134)

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Command exit status:
  134

Command output:
  (empty)

Command error:
  terminate called after throwing an instance of 'std::runtime_error'
    what():  creating UniFilter: default_filter failed
  error in rules, line=-1 at postion: -1
  .command.sh: line 8:   117 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Work dir:
  /home/lamachine/work/a3/d40f1d68fdcf50c25d4e4ece638133

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
willstout commented 6 years ago

Actually now that I think about it, the filter is probably a tsv, gonna try that

--EDIT: Didn't work, it has the same result

martinreynaert commented 6 years ago

Dear Will,

I have no idea what you are trying to do.

In the usage (obtained with parameter -h) it says for --filter: "default the following filter is used:"

This implies you do not need to supply a filter or even to pass on the --filter parameter.

I advise you to keep things as simple as possible. We probably have too many misleading parameters right now. If there are defaults, please go with them. All the rest are experimental and, unfortunately, insufficiently documented. We are working on that.

willstout commented 6 years ago

The filter change was something mentioned by @kosloot as a last resort. And alright I'll try to work on other things from here, thanks!

kosloot commented 6 years ago

So, changing the filter is not a definitive solution. But as pointed out by Will, it fails to initialize in his setup. Therefor i suggested to try to disable it temporally to be able to continue the process. At a later moment the will be time to investigate why it fails.

@willstout the filter is NOT a simple TSV. See http://userguide.icu-project.org/transforms/general/rules if you really wonder what it is about.

kosloot commented 6 years ago

@willstout an @proycon : One possible explanation is, that the locale is wrong. The unicode filter will fail when the locale is NOT an UTF8 variant. Just tested this with a setting of LANG=C

So please be sure to use the right locale. e.g. en_US.UTF8, or nl_NL.utf8 or such.

@proycon I thought that an UTF8 locale was enforced by LaMachine?

willstout commented 6 years ago

@kosloot So the filter input is essentially a language filter that is used by the program to change certain characters into other characters? Am I getting that correct?

kosloot commented 6 years ago

Well, it is generic filter, which is used to:

The filter rules are heuristically developed during our processing of texts in several languages. There are certainly more cases to be found where a filter could help even more.

Beside the filters, TICCL-unk has a lot of other heuristics, which really need some documentation....

Did setting/changing the locale help?

willstout commented 6 years ago

I haven't because I'm not sure how I would add that too the ticcl.nf command.

I have:

ticcl.nf --inputdir /home/lamachine/ --lexicon eng.aspell.dict --alphabet eng.aspell.dict.lc.chars --charconfus eng.aspell.dict.c0.d2.confusion

Where do I add the locale to that?

I tried adding --locale en_US.UTF8 to the end of what I had but I got the same error I always get

kosloot commented 6 years ago

No that will not work locale is a OS wide setting, but can be overridden in your shell. I'm not sure what will be the best way to do it in LaMachine @proycon can tell you, after his holidays.

Also it is wise to check the output of the locale command. It should read something like:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"

If that is different, then the first thing I would try is to add LANG=en_US.UTF8 before the ticcl.nf line. On the same line!

So: LANG=en_US.UTF8 ticcl.nf --inputdir /home/lamachine/ --lexicon ......

If that doesn't work, then the next attempt would be to add this lines to your .bashrc file: export LANG="en_US.UTF-8" export LC_ALL="en_US.UTF-8"

and start a new shell.

willstout commented 6 years ago

I went through and made sure all the locale stuff was correct within my .bashrc. When going through the git console and typing locale i get

$ locale
LANG=en_US.UTF8
LC_CTYPE="en_US.UTF8"
LC_NUMERIC="en_US.UTF8"
LC_TIME="en_US.UTF8"
LC_COLLATE="en_US.UTF8"
LC_MONETARY="en_US.UTF8"
LC_MESSAGES="en_US.UTF8"
LC_ALL=en_US.UTF8

but when I open up my docker and run lamachine then check the locale it changes on me back to what it normally is

lamachine@2cf37561e1c6:/$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
lamachine@2cf37561e1c6:/$

and if I specify the LANG and then add the rest of the ticcl.nf line I get

lamachine@44c91a2d8955:~$ LANG=en_US.UTF8 ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars
--charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ticcl.nf` [pedantic_einstein] - revision: 3bd4e988b7
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[c2/f70e2d] Submitted process > corpusfrequency (1)
[ee/877eae] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'ticclunk (1)'

Caused by:
  Process `ticclunk (1)` terminated with an error exit status (1)

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"

Command exit status:
  1

Command output:
  (empty)

Command error:
  unable to open background file: eng.aspell.dict

Work dir:
  /home/lamachine/work/ee/877eae81b31d4c34ba315fae4fd301

LaMachine is doing something to change the locale to POSIX and it's screwing up ticcl.nf

proycon commented 6 years ago

The locale issue should now be resolved in LaMachine v2.2.12, could you retry at some point?

willstout commented 6 years ago

That works!

proycon commented 6 years ago

Great, I presume this resolves the entire issue (reopen if not)