Closed proycon closed 4 years ago
@martinreynaert TICCL-stats is currently not implemented in the pipeline yet I see, I suppose we want to use that instead of FoLiA-stats for plain text input, as you suggested?
@kosloot FoLiA-stats reports a wrong exit code here despite the error: 0 instead of non-zero... is it possible that FoLiA-txt might have a similar problem? The conversion might have failed for some reason and have gone undetected because of a wrong exit code?
Great- @proycon now it works at least in a pipline: tesseract ->folia _>postproccesed folia ->txt My additional question :As for now, I can not use my own txt's which was tesseracted from pdf's outside the LaMachine , I would like to know wihch version of tesseract is included in Docker Image? I found tesseract 4.0 and up as a big improvement in quality thanks to LSTM networks. Are you using it in LaMachine already? Thanks
Dear Mathias,
The Tesseract version is documented at least in the 'hocr' files:
I search for them in the work/ directory in the following way:
$ du -a /var/www/webservices-lst/live/writable/piccl/projects/mre/MorsePDF/work
I am not planning to move to Tesseract 4.0 at this stage. As far as I read, that has not yet been trained on Fraktur a lot and a lot of what I do involves this older print. But we might look into perhaps providing both. You should be able to insert your own OCR-ed texts into the pipeline. Please state why this does not work.
One thing that puzzled me earlier: you seemed to have the ICDAR testdata as input. Now, as far as I know that were French and English texts. So, why do you use a Polish alphabet and lexicon?
Martin
The tesseract version in the image is 3.05.01-3 with data files for 3.04 as Martin reported. The reason is simple; that's the version the underlying Arch Linux packages are built for, which LaMachine in turn uses. When those packages are updated eventually, LaMachine follows automatically. (see also https://www.archlinux.org/packages/?sort=&q=tesseract)
(The default LaMachine base distribution might change from Arch to Debian in the future though, which tends to be more conservative)
Dear @martinreynaert , input folder indeed had some files from that competition but I added folder /text and filled it with polish .txt files (output of tesseract) to keep my all experiments with LaMachine in one directory only. As I stated earlier I wanted to use ticcl.nf on my files ocr -ed outside of LaMachine but ticcl does not want to take .txt 's as an input - here is my stdin and stder by La Machine
[root@XXXXXXXX LaMachine]# nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir /home/projects/Kaggle_denoise/out/ --outputdir /home/projects/ --inputtype text --language pol --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/pol.aspell.dict.lc.chars --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion
N E X T F L O W ~ version 0.27.4
Launching `LanguageMachines/PICCL` [awesome_chandrasekhar] - revision: a006ed747c [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [f6412c8f23]
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[0e/e44c17] Submitted process > txt2folia (1)
[73/f2e17a] Submitted process > corpusfrequency (1)
ERROR ~ Error executing process > 'corpusfrequency (1)'
Caused by:
Missing output file(s) `corpus.wordfreqlist.tsv` expected by process `corpusfrequency (1)`
Command executed:
set +u
if [ ! -z "" ]; then
source /bin/activate
fi
set -u
FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .
Command exit status:
0
Command output:
start processing of 1 files
done processsing directory '.'
start calculating the results
in total 0 n-grams were found.
Command error:
XML-error: PCDATA invalid Char value 12
FoLiA-stats: failed to load document './doc.folia.xml'
FoLiA-stats: reason: XML error: No XML document read
Work dir:
/usr/src/LaMachine/work/73/f2e17aec34212b43237385fb02c374
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
We have still problems with plain text.
https://webservices-lst.science.ru.nl/piccl/Eight/output/error.log: error.nijmegen.log
In the leiden instance, the small text file does give results, but other files do not
The following input file havelaar.txt
Gives this log file:
[CLAM Dispatcher] Adding to PYTHONPATH: /vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice
[CLAM Dispatcher] Started CLAM Dispatcher v2.3.3 with picclservice.picclservice (2018-11-26 10:20:23)
[CLAM Dispatcher] Running /vol1/lamachine/bin/python3 "/vol1/lamachine/lib/python3.6/site-packages/PICCL-0.6.2-py3.6.egg/picclservice/picclservice_wrapper.py" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/clam.xml" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/.status" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/input/" "/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/output/" "/vol1/lamachine/piccldata" "/vol1/lamachine/src/PICCL"
[CLAM Dispatcher] Running with pid 35434 (2018-11-26 10:20:23)
Running PICCL from /vol1/lamachine/src/PICCL/
System default encoding: utf-8
Forcing en_US.UTF-8 locale...
Tokeniser enabled (True)
Command: /vol1/lamachine/src/PICCL/ticcl.nf --inputdir . --inputtype text --outputdir "ticcl_out" --lexicon lexicon.lst --alphabet alphabet.lst --charconfus confusion.lst --clip 1 --distance 2 --clip 1 --pdfhandling single -with-trace >ticcl.nextflow.out.log 2>ticcl.nextflow.err.log
[ticcl] Nextflow standard error output
-------------------------------------------------
[ticcl] Nextflow standard output
-------------------------------------------------
N E X T F L O W ~ version 0.30.0
Launching `/vol1/lamachine/src/PICCL/ticcl.nf` [silly_swanson] - revision: 4d24a17dc3
--------------------------
TICCL Pipeline
--------------------------
[warm up] executor > local
[c0/f4c6c6] Submitted process > txt2folia (2)
[02/38ac7e] Submitted process > txt2folia (1)
[62/60b850] Submitted process > corpusfrequency (1)
[2e/93a473] Submitted process > ticclunk (1)
ERROR ~ Error executing process > 'txt2folia (1)'
Caused by:
Missing output file(s) `havelaar.folia.xml` expected by process `txt2folia (1)`
Command executed:
set +u
if [ ! -z "" ]; then
source /bin/activate
fi
set -u
FoLiA-txt --class OCR -t 1 -O . "havelaar.txt"
Command exit status:
0
Command output:
(empty)
Command error:
nu useful data found in document:'havelaar'
skipped!
Work dir:
/vol1/lamachine/piccl.clam/projects/j.de.does@umail.leidenuniv.nl/WeerEenHavelaar/work/02/38ac7e622706e0225f9b7c9c540a57
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (1)
[ticcl] Nextflow trace summary
-------------------------------------------------
task_id hash native_id name status exit submit duration realtime %cpu rss vmem rchar wchar
2 c0/f4c6c6 35607 txt2folia (2) COMPLETED 0 2018-11-26 10:20:27.490 301ms 42ms 0,0% 0 0 0 0
3 62/60b850 35777 corpusfrequency (1) COMPLETED 0 2018-11-26 10:20:27.835 257ms 37ms 0,0% 0 0 0 0
1 02/38ac7e 35620 txt2folia (1) FAILED 0 2018-11-26 10:20:27.524 5.7s 3.7s 102,0% 37 MB 2208 MB 7878 KB 0
4 2e/93a473 35932 ticclunk (1) ABORTED - 2018-11-26 10:20:28.111 - - - - - - -
[CLAM Dispatcher] Process ended (2018-11-26 10:20:33, 10.01889s)
[CLAM Dispatcher] Removing temporary files
[CLAM Dispatcher] Status code out of range (256), setting to 127
[CLAM Dispatcher] Finished (2018-11-26 10:20:33), exit code 127, dispatcher wait time 10.0s, duration 10.019817s
Confirmed, it seems FoLiA-txt
chokes on that file for some reason. But on the latest development version of the ticcltools it does run fine, so it seems @kosloot already fixed whatever was the problem. So then the question becomes when @martinreynaert and @kosloot deem ticcltools ready for release?
(correction, it's part of foliautils and not ticcltools)
A new release of foliautils is published. Should be available in LaMachine very soon
@JessedeDoes Should be solved by the foliautils release, also deployed in Nijmegen.
(a problem was still found when TICCL was disabled, a fix has been commited (not releases/deployed yet))
We are trying to finish the INT installation.
Can this be released?
Ah yes, this was released already since v0.7.2, forgot to mention it here.
(closing this issue, it should be solved for a while already)
Moved from proycon/LaMachine#37, by @mathias3
@martinreynaert Can you replicate this problem and suggest a remedy?