LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

TICCL fails on plain text file; indexerNT result is empty #50

Closed peterdekker closed 4 years ago

peterdekker commented 5 years ago

When processing a simple text file (with just two Dutch sentences) through the PICCL webinterface, TICCL fails. This is the error.log: https://pastebin.ubuntu.com/p/mmV9sqk8n2/ Input file:

dit is een zin dit ook

For our deployment, this is not a priority, but just filing it here to let you know!

proycon commented 5 years ago

This seems very similar to #52 , both fail on FoLiA-correct because of missing output

peterdekker commented 5 years ago

I ran the same file, in a new installation, the same error occurs: https://pastebin.ubuntu.com/p/26Xfk4jRcD/

proycon commented 4 years ago

I wasn't sure of the status of this issue, there have been various fixes in the meantime, so I again checked; unfortunately, this issue is indeed still relevant as the indexer yields no results:

executor >  local (5)
[bf/8f54b6] process > txt2folia       [100%] 1 of 1 ✔
[23/7a5574] process > corpusfrequency [100%] 1 of 1 ✔
[34/7e7fc3] process > ticclunk        [100%] 1 of 1 ✔
[fd/9bcaa7] process > anahash         [100%] 1 of 1 ✔
[19/103baa] process > indexer         [100%] 1 of 1, failed: 1 ✘
[-        ] process > resolver        -
[-        ] process > rank            -
[-        ] process > chainer         -
[-        ] process > foliacorrect    -
Error executing process > 'indexer (1)'

Caused by:
  Process `indexer (1)` terminated with an error exit status (6)

Command executed:

  #!/bin/bash
  set +u
  if [ ! -z "/var/www/lamachine2/weblamachine" ]; then
      source /var/www/lamachine2/weblamachine/bin/activate
  fi
  set -u

  TICCL-indexerNT --hash "corpus.wordfreqlist.tsv.clean.anahash" --charconf "confusion.lst" --foci "corpus.wordfreqlist.tsv.clean.corpusfoci" -o "corpus.wordfreqlist.tsv.clean" -t 56 --low 5 --high 35 || exit 1

  if [ ! -s "corpus.wordfreqlist.tsv.clean.indexNT" ]; then
      echo "ERROR: Expected output corpus.wordfreqlist.tsv.clean.indexNT does not exist or is empty">&2
      exit 6
  fi

Command exit status:
  6

Command output:
  Now using node v13.13.0 (npm v6.14.4)
  reading corpus word anagram hash values
  read 206669 corpus word anagram values
  skipped 2424 out-of-band corpus word values
  read 1 foci values
  read 275652 character confusion anagram values
  created 1 separate experiments
  running on 1 threads.

  wrote indexes into: corpus.wordfreqlist.tsv.clean.indexNT

Command error:

  ERROR: Expected output corpus.wordfreqlist.tsv.clean.indexNT does not exist or is empty

I'm unassigning myself though (this is not something I can maintain or solve if it's caused by a deeper issue in ticcltools). If it's a deemed a pipeline problem and there's a viable solution proposed to it then I can help again.

martinreynaert commented 4 years ago

This was a non-issue.

Within TICCL a minimum word length is set. All words in this 'input files' are at most three characters. Also, they are all very common and correctly spelled words and present in even the most basic TICCL lexicon e.g. the Aspell lexicon. So there is nothing here for TICCL to work on.

If you want to see TICCL work properly, feed it a proper text, please. To see it work well, feed it either a large corpus to process or give it a large lexicon and name list, or do both.