LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

Plain text processing does not work as expected? #23

Closed proycon closed 4 years ago

proycon commented 6 years ago

Moved from proycon/LaMachine#37, by @mathias3

nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir /home/projects/Kaggle_denoise/ICDAR-2017-Post-OCR-Correction/text/ --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/pol.aspell.dict.lc.chars --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion --inputtype 'text'

N E X T F L O W ~ version 0.27.4
Launching LanguageMachines/PICCL [mad_legentil] - revision: a006ed747c [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [e9754ef2e1]

TICCL Pipeline

[warm up] executor > local
[f4/ce2b3e] Submitted process > txt2folia (1)
[3f/e17ce8] Submitted process > corpusfrequency (1)
ERROR ~ Error executing process > 'corpusfrequency (1)'

Caused by:
Missing output file(s) corpus.wordfreqlist.tsv expected by process corpusfrequency (1)

Command executed:

set +u
if [ ! -z "" ]; then
source /bin/activate
fi
set -u

FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .

Command exit status:
0

Command output:
start processing of 1 files
done processsing directory '.'
start calculating the results
in total 0 n-grams were found.

Command error:

XML-error: PCDATA invalid Char value 12

FoLiA-stats: failed to load document './doc.folia.xml'
FoLiA-stats: reason: XML error: No XML document read

Work dir:
/usr/src/LaMachine/work/3f/e17ce8cb5ee9d497957609b34fdc29

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details`

@martinreynaert Can you replicate this problem and suggest a remedy?

proycon commented 6 years ago

@martinreynaert TICCL-stats is currently not implemented in the pipeline yet I see, I suppose we want to use that instead of FoLiA-stats for plain text input, as you suggested?

proycon commented 6 years ago

@kosloot FoLiA-stats reports a wrong exit code here despite the error: 0 instead of non-zero... is it possible that FoLiA-txt might have a similar problem? The conversion might have failed for some reason and have gone undetected because of a wrong exit code?

mathias3 commented 6 years ago

Great- @proycon now it works at least in a pipline: tesseract ->folia _>postproccesed folia ->txt My additional question :As for now, I can not use my own txt's which was tesseracted from pdf's outside the LaMachine , I would like to know wihch version of tesseract is included in Docker Image? I found tesseract 4.0 and up as a big improvement in quality thanks to LSTM networks. Are you using it in LaMachine already? Thanks

martinreynaert commented 6 years ago

Dear Mathias,

The Tesseract version is documented at least in the 'hocr' files:

I search for them in the work/ directory in the following way:

$ du -a /var/www/webservices-lst/live/writable/piccl/projects/mre/MorsePDF/work

I am not planning to move to Tesseract 4.0 at this stage. As far as I read, that has not yet been trained on Fraktur a lot and a lot of what I do involves this older print. But we might look into perhaps providing both. You should be able to insert your own OCR-ed texts into the pipeline. Please state why this does not work.

One thing that puzzled me earlier: you seemed to have the ICDAR testdata as input. Now, as far as I know that were French and English texts. So, why do you use a Polish alphabet and lexicon?

Martin

proycon commented 6 years ago

The tesseract version in the image is 3.05.01-3 with data files for 3.04 as Martin reported. The reason is simple; that's the version the underlying Arch Linux packages are built for, which LaMachine in turn uses. When those packages are updated eventually, LaMachine follows automatically. (see also https://www.archlinux.org/packages/?sort=&q=tesseract)

(The default LaMachine base distribution might change from Arch to Debian in the future though, which tends to be more conservative)

mathias3 commented 6 years ago

Dear @martinreynaert , input folder indeed had some files from that competition but I added folder /text and filled it with polish .txt files (output of tesseract) to keep my all experiments with LaMachine in one directory only. As I stated earlier I wanted to use ticcl.nf on my files ocr -ed outside of LaMachine but ticcl does not want to take .txt 's as an input - here is my stdin and stder by La Machine

My Input cmd

[root@XXXXXXXX LaMachine]# nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir /home/projects/Kaggle_denoise/out/ --outputdir /home/projects/ --inputtype text --language pol --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/pol.aspell.dict.lc.chars --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion

LaMachine output

N E X T F L O W  ~  version 0.27.4
Launching `LanguageMachines/PICCL` [awesome_chandrasekhar] - revision: a006ed747c [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [f6412c8f23]
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[0e/e44c17] Submitted process > txt2folia (1)
[73/f2e17a] Submitted process > corpusfrequency (1)
ERROR ~ Error executing process > 'corpusfrequency (1)'

Caused by:
  Missing output file(s) `corpus.wordfreqlist.tsv` expected by process `corpusfrequency (1)`

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi

  set -u

  FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .

Command exit status:
  0

Command output:
  start processing of 1 files 
  done processsing directory '.'
  start calculating the results
  in total 0 n-grams were found.

Command error:

  XML-error: PCDATA invalid Char value 12

  FoLiA-stats: failed to load document './doc.folia.xml'
  FoLiA-stats: reason: XML error: No XML document read

Work dir:
  /usr/src/LaMachine/work/73/f2e17aec34212b43237385fb02c374

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
JessedeDoes commented 5 years ago

We have still problems with plain text.

https://webservices-lst.science.ru.nl/piccl/Eight/output/error.log: error.nijmegen.log

In the leiden instance, the small text file does give results, but other files do not

The following input file havelaar.txt

Gives this log file:

[CLAM Dispatcher] Adding to PYTHONPATH: /vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice
[CLAM Dispatcher] Started CLAM Dispatcher v2.3.3 with picclservice.picclservice (2018-11-26 10:20:23)
[CLAM Dispatcher] Running /vol1/lamachine/bin/python3 "/vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice/picclservice_wrapper.py" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/clam.xml" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/.status" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/input/" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/output/" "/vol1/lamachine/piccldata" "/vol1/lamachine/src/PICCL"
[CLAM Dispatcher] Running with pid 35434 (2018-11-26 10:20:23)
Running PICCL from /vol1/lamachine/src/PICCL/
System default encoding:  utf-8
Forcing en_US.UTF-8 locale...
Tokeniser enabled (True)
Command: /vol1/lamachine/src/PICCL/ticcl.nf --inputdir . --inputtype text --outputdir "ticcl_out" --lexicon lexicon.lst --alphabet alphabet.lst --charconfus confusion.lst --clip 1 --distance 2 --clip 1 --pdfhandling single -with-trace >ticcl.nextflow.out.log 2>ticcl.nextflow.err.log
[ticcl] Nextflow standard error output
-------------------------------------------------

[ticcl] Nextflow standard output
-------------------------------------------------
N E X T F L O W  ~  version 0.30.0
Launching `/vol1/lamachine/src/PICCL/ticcl.nf` [silly_swanson] - revision: 4d24a17dc3
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[c0/f4c6c6] Submitted process > txt2folia (2)
[02/38ac7e] Submitted process > txt2folia (1)
[62/60b850] Submitted process > corpusfrequency (1)
[2e/93a473] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'txt2folia (1)'

Caused by:
  Missing output file(s) `havelaar.folia.xml` expected by process `txt2folia (1)`

Command executed:

  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  FoLiA-txt --class OCR -t 1 -O . "havelaar.txt"

Command exit status:
  0

Command output:
  (empty)

Command error:
  nu useful data found in document:'havelaar'
  skipped!

Work dir:
  /vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/work/02/38ac7e622706e0225f9b7c9c540a57

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
WARN: Killing pending tasks (1)

[ticcl] Nextflow trace summary
-------------------------------------------------
task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    rss vmem    rchar   wchar
2   c0/f4c6c6   35607   txt2folia (2)   COMPLETED   0   2018-11-26 10:20:27.490 301ms   42ms    0,0%    0   0   0   0
3   62/60b850   35777   corpusfrequency (1) COMPLETED   0   2018-11-26 10:20:27.835 257ms   37ms    0,0%    0   0   0   0
1   02/38ac7e   35620   txt2folia (1)   FAILED  0   2018-11-26 10:20:27.524 5.7s    3.7s    102,0%  37 MB   2208 MB 7878 KB 0
4   2e/93a473   35932   ticclunk (1)    ABORTED -   2018-11-26 10:20:28.111 -   -   -   -   -   -   -

[CLAM Dispatcher] Process ended (2018-11-26 10:20:33, 10.01889s) 
[CLAM Dispatcher] Removing temporary files
[CLAM Dispatcher] Status code out of range (256), setting to 127
[CLAM Dispatcher] Finished (2018-11-26 10:20:33), exit code 127, dispatcher wait time 10.0s, duration 10.019817s
proycon commented 5 years ago

Confirmed, it seems FoLiA-txt chokes on that file for some reason. But on the latest development version of the ticcltools it does run fine, so it seems @kosloot already fixed whatever was the problem. So then the question becomes when @martinreynaert and @kosloot deem ticcltools ready for release?

proycon commented 5 years ago

(correction, it's part of foliautils and not ticcltools)

kosloot commented 5 years ago

A new release of foliautils is published. Should be available in LaMachine very soon

proycon commented 5 years ago

@JessedeDoes Should be solved by the foliautils release, also deployed in Nijmegen.

proycon commented 5 years ago

(a problem was still found when TICCL was disabled, a fix has been commited (not releases/deployed yet))

JessedeDoes commented 5 years ago

We are trying to finish the INT installation.

Can this be released?

proycon commented 5 years ago

Ah yes, this was released already since v0.7.2, forgot to mention it here.

proycon commented 4 years ago

(closing this issue, it should be solved for a while already)