LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
48 stars 6 forks source link

Plain text processing does not work as expected? #23

Closed proycon closed 4 years ago

proycon commented 6 years ago

Moved from proycon/LaMachine#37, by @mathias3

nextflow run LanguageMachines/PICCL/ --inputdir /home/projects/Kaggle_denoise/ICDAR-2017-Post-OCR-Correction/text/ --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/ --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion --inputtype 'text'

N E X T F L O W ~ version 0.27.4
Launching LanguageMachines/PICCL [mad_legentil] - revision: a006ed747c [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [e9754ef2e1]

TICCL Pipeline

[warm up] executor > local
[f4/ce2b3e] Submitted process > txt2folia (1)
[3f/e17ce8] Submitted process > corpusfrequency (1)
ERROR ~ Error executing process > 'corpusfrequency (1)'

Caused by:
Missing output file(s) corpus.wordfreqlist.tsv expected by process corpusfrequency (1)

Command executed:

set +u
if [ ! -z "" ]; then
source /bin/activate
set -u

FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .

Command exit status:

Command output:
start processing of 1 files
done processsing directory '.'
start calculating the results
in total 0 n-grams were found.

Command error:

XML-error: PCDATA invalid Char value 12

FoLiA-stats: failed to load document './doc.folia.xml'
FoLiA-stats: reason: XML error: No XML document read

Work dir:

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details`

@martinreynaert Can you replicate this problem and suggest a remedy?

proycon commented 6 years ago

@martinreynaert TICCL-stats is currently not implemented in the pipeline yet I see, I suppose we want to use that instead of FoLiA-stats for plain text input, as you suggested?

proycon commented 6 years ago

@kosloot FoLiA-stats reports a wrong exit code here despite the error: 0 instead of non-zero... is it possible that FoLiA-txt might have a similar problem? The conversion might have failed for some reason and have gone undetected because of a wrong exit code?

mathias3 commented 6 years ago

Great- @proycon now it works at least in a pipline: tesseract ->folia _>postproccesed folia ->txt My additional question :As for now, I can not use my own txt's which was tesseracted from pdf's outside the LaMachine , I would like to know wihch version of tesseract is included in Docker Image? I found tesseract 4.0 and up as a big improvement in quality thanks to LSTM networks. Are you using it in LaMachine already? Thanks

martinreynaert commented 6 years ago

Dear Mathias,

The Tesseract version is documented at least in the 'hocr' files:

I search for them in the work/ directory in the following way:

$ du -a /var/www/webservices-lst/live/writable/piccl/projects/mre/MorsePDF/work

I am not planning to move to Tesseract 4.0 at this stage. As far as I read, that has not yet been trained on Fraktur a lot and a lot of what I do involves this older print. But we might look into perhaps providing both. You should be able to insert your own OCR-ed texts into the pipeline. Please state why this does not work.

One thing that puzzled me earlier: you seemed to have the ICDAR testdata as input. Now, as far as I know that were French and English texts. So, why do you use a Polish alphabet and lexicon?


proycon commented 6 years ago

The tesseract version in the image is 3.05.01-3 with data files for 3.04 as Martin reported. The reason is simple; that's the version the underlying Arch Linux packages are built for, which LaMachine in turn uses. When those packages are updated eventually, LaMachine follows automatically. (see also

(The default LaMachine base distribution might change from Arch to Debian in the future though, which tends to be more conservative)

mathias3 commented 6 years ago

Dear @martinreynaert , input folder indeed had some files from that competition but I added folder /text and filled it with polish .txt files (output of tesseract) to keep my all experiments with LaMachine in one directory only. As I stated earlier I wanted to use on my files ocr -ed outside of LaMachine but ticcl does not want to take .txt 's as an input - here is my stdin and stder by La Machine

My Input cmd

[root@XXXXXXXX LaMachine]# nextflow run LanguageMachines/PICCL/ --inputdir /home/projects/Kaggle_denoise/out/ --outputdir /home/projects/ --inputtype text --language pol --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/ --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion

LaMachine output

N E X T F L O W  ~  version 0.27.4
Launching `LanguageMachines/PICCL` [awesome_chandrasekhar] - revision: a006ed747c [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [f6412c8f23]
TICCL Pipeline
[warm up] executor > local
[0e/e44c17] Submitted process > txt2folia (1)
[73/f2e17a] Submitted process > corpusfrequency (1)
ERROR ~ Error executing process > 'corpusfrequency (1)'

Caused by:
  Missing output file(s) `corpus.wordfreqlist.tsv` expected by process `corpusfrequency (1)`

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate

  set -u

  FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .

Command exit status:

Command output:
  start processing of 1 files 
  done processsing directory '.'
  start calculating the results
  in total 0 n-grams were found.

Command error:

  XML-error: PCDATA invalid Char value 12

  FoLiA-stats: failed to load document './doc.folia.xml'
  FoLiA-stats: reason: XML error: No XML document read

Work dir:

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
JessedeDoes commented 5 years ago

We have still problems with plain text. error.nijmegen.log

In the leiden instance, the small text file does give results, but other files do not

The following input file havelaar.txt

Gives this log file:

[CLAM Dispatcher] Adding to PYTHONPATH: /vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice
[CLAM Dispatcher] Started CLAM Dispatcher v2.3.3 with picclservice.picclservice (2018-11-26 10:20:23)
[CLAM Dispatcher] Running /vol1/lamachine/bin/python3 "/vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice/" "/vol1/lamachine/piccl.clam/projects/" "/vol1/lamachine/piccl.clam/projects/" "/vol1/lamachine/piccl.clam/projects/" "/vol1/lamachine/piccl.clam/projects/" "/vol1/lamachine/piccldata" "/vol1/lamachine/src/PICCL"
[CLAM Dispatcher] Running with pid 35434 (2018-11-26 10:20:23)
Running PICCL from /vol1/lamachine/src/PICCL/
System default encoding:  utf-8
Forcing en_US.UTF-8 locale...
Tokeniser enabled (True)
Command: /vol1/lamachine/src/PICCL/ --inputdir . --inputtype text --outputdir "ticcl_out" --lexicon lexicon.lst --alphabet alphabet.lst --charconfus confusion.lst --clip 1 --distance 2 --clip 1 --pdfhandling single -with-trace >ticcl.nextflow.out.log 2>ticcl.nextflow.err.log
[ticcl] Nextflow standard error output

[ticcl] Nextflow standard output
N E X T F L O W  ~  version 0.30.0
Launching `/vol1/lamachine/src/PICCL/` [silly_swanson] - revision: 4d24a17dc3
TICCL Pipeline
[warm up] executor > local
[c0/f4c6c6] Submitted process > txt2folia (2)
[02/38ac7e] Submitted process > txt2folia (1)
[62/60b850] Submitted process > corpusfrequency (1)
[2e/93a473] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'txt2folia (1)'

Caused by:
  Missing output file(s) `havelaar.folia.xml` expected by process `txt2folia (1)`

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  set -u

  FoLiA-txt --class OCR -t 1 -O . "havelaar.txt"

Command exit status:

Command output:

Command error:
  nu useful data found in document:'havelaar'

Work dir:

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash`

 -- Check '.nextflow.log' file for details
WARN: Killing pending tasks (1)

[ticcl] Nextflow trace summary
task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    rss vmem    rchar   wchar
2   c0/f4c6c6   35607   txt2folia (2)   COMPLETED   0   2018-11-26 10:20:27.490 301ms   42ms    0,0%    0   0   0   0
3   62/60b850   35777   corpusfrequency (1) COMPLETED   0   2018-11-26 10:20:27.835 257ms   37ms    0,0%    0   0   0   0
1   02/38ac7e   35620   txt2folia (1)   FAILED  0   2018-11-26 10:20:27.524 5.7s    3.7s    102,0%  37 MB   2208 MB 7878 KB 0
4   2e/93a473   35932   ticclunk (1)    ABORTED -   2018-11-26 10:20:28.111 -   -   -   -   -   -   -

[CLAM Dispatcher] Process ended (2018-11-26 10:20:33, 10.01889s) 
[CLAM Dispatcher] Removing temporary files
[CLAM Dispatcher] Status code out of range (256), setting to 127
[CLAM Dispatcher] Finished (2018-11-26 10:20:33), exit code 127, dispatcher wait time 10.0s, duration 10.019817s
proycon commented 5 years ago

Confirmed, it seems FoLiA-txt chokes on that file for some reason. But on the latest development version of the ticcltools it does run fine, so it seems @kosloot already fixed whatever was the problem. So then the question becomes when @martinreynaert and @kosloot deem ticcltools ready for release?

proycon commented 5 years ago

(correction, it's part of foliautils and not ticcltools)

kosloot commented 5 years ago

A new release of foliautils is published. Should be available in LaMachine very soon

proycon commented 5 years ago

@JessedeDoes Should be solved by the foliautils release, also deployed in Nijmegen.

proycon commented 5 years ago

(a problem was still found when TICCL was disabled, a fix has been commited (not releases/deployed yet))

JessedeDoes commented 5 years ago

We are trying to finish the INT installation.

Can this be released?

proycon commented 5 years ago

Ah yes, this was released already since v0.7.2, forgot to mention it here.

proycon commented 4 years ago

(closing this issue, it should be solved for a while already)