LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

Please make test book available in the PICCL workflow #15

Closed martinreynaert closed 6 years ago

martinreynaert commented 6 years ago

Please amek available the following test book version in the PICCL work flow:

[mreynaert@scootaloo:~]$ ls -l /vol/tensusers/mreynaert/DPO35tiff.tar.gz -rw-rw-r-- 1 mreynaert mreynaert 1304172529 Feb 5 16:07 /vol/tensusers/mreynaert/DPO35tiff.tar.gz

martinreynaert commented 6 years ago

amek = make

proycon commented 6 years ago

I can add this as input source but I'm not entirely sure whether it won't trip of the file names. Can you try outside the webservice first? Will plug it in as an input source when it behaves according to expectation.

proycon commented 6 years ago

(default changed and updated on ponyland, please test outside webservice first)

martinreynaert commented 6 years ago

Something did not work as planned, I am left clueless.

(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ nextflow run LanguageMachines/PICCL/ocr.nf --inputdir /vol/tensusers/mreynaert/DPO35/TIF/ --language nld DPO35tiff.OCR.20180205.stdout 2>DPO35tiff.OCR.20180205.stderr
N E X T F L O W  ~  version 0.26.4
Launching `LanguageMachines/PICCL` [peaceful_boyd] - revision: f1d6be93b1 [master]
WARN: The config file defines settings for an unknown process: indexer
WARN: The config file defines settings for an unknown process: resolver
WARN: The config file defines settings for an unknown process: rank
WARN: The config file defines settings for an unknown process: foliacorrect -- Did you mean: foliacat?
WARN: The config file defines settings for an unknown process: frog_original
WARN: The config file defines settings for an unknown process: modernize
WARN: The config file defines settings for an unknown process: frog_modernized

--------------------------
OCR Pipeline
--------------------------
[warm up] executor > local
WARN: The `into` operator should be used to connect two or more target channels -- consider to replace it with `.set { pageimages_bitmap }`
WARN: The `into` operator should be used to connect two or more target channels -- consider to replace it with `.set { groupfoliapages }`
(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ ls -l
total 1273628
-rw-rw-r-- 1 mreynaert mreynaert          0 Feb  5 17:28 DPO35tiff.OCR.20180205.stderr
-rw-rw-r-- 1 mreynaert mreynaert 1304172529 Feb  5 16:07 DPO35tiff.tar.gz
drwxrwxr-x 2 mreynaert mreynaert      16384 Feb  5 15:57 TIF
drwxrwxr-x 2 mreynaert mreynaert         10 Feb  5 17:29 work
(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ ls -l work/
total 0
(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ 
martinreynaert commented 6 years ago

I should obviously have specified --inputtype tif

martinreynaert commented 6 years ago

Get an error:

[a6/a8a48d] Submitted process > foliacat (23)
[2b/e0e7cc] Submitted process > foliacat (25)
ERROR ~ Error executing process > 'foliacat (1)'

Caused by:
  Process `foliacat (1)` terminated with an error exit status (1)

Command executed:

  set +u
  if [ ! -z "/vol/customopt/lamachine16" ]; then
      source /vol/customopt/lamachine16/bin/activate
  fi
  set -u

  if [ -f .tif.folia.xml ]; then
      #only one file, nothing to cat
      cp $foliainput dpo_35_0120_master.folia.xml
  else
      foliainput=$(ls -1v *.tif.folia.xml)
      foliacat -i dpo_35_0120_master -o dpo_35_0120_master.folia.xml $foliainput
  fi

Command exit status:
  1

Command output:
  ==============================================================================
             ,              LaMachine - NLP Software distribution 
            ~)                     (https://proycon.github.io/LaMachine)
             (----í         Language Machines research group
              /| |\         & Centre for Language and Speech Technology
             / / /|         Radboud University Nijmegen 
  ==============================================================================

  Available software: CLAM (clamservice), Colibri Core (colibri-patternmodeller),
                      FoLiA Tools (foliavalidator, folia2txt, folia2html, foliaquery etc), 
                      foliadocserve, foliautils (folialint etc),
                      frog, gecco, mbt, mbtserver, ticcltools, timbl, toad (froggen),
                      ucto, wopr

  Python libraries:   pynlpl ucto frog timbl clam colibricore

  Run lamachine-test.sh to test your installation, run lamachine-update.sh to
  update everything (with sudo only if you use Vagrant or Docker).

      (Set LAMACHINE_QUIET=1 prior to activation to suppress this message)

Command error:
  .command.sh: line 10: foliainput: unbound variable

Work dir:
  /vol/tensusers/mreynaert/DPO35/work/03/842ed0d7835ae9ebe34f728b1909c1

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
[88/66c19c] Submitted process > foliacat (27)
[49/d8f766] Submitted process > foliacat (59)
WARN: Killing pending tasks (19)
(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ 
martinreynaert commented 6 years ago

Here's where .nextflow.log is at:

(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ cat .nextflow.log

Do not understand what it says.

proycon commented 6 years ago

As mentioned before, in ponyland LaMachine, run the scripts directly instead of prefixed with nextflow run. You're running an older cached version by nextflow, you should be able to just have ocr.nf etc in your path.

martinreynaert commented 6 years ago

That does not seem to work:

(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ ocr.nf --inputtype tif --inputdir /vol/tensusers/mreynaert/DPO35/TIF/ --language nld DPO35tiff.OCR.20180205.BIS.stdout 2>DPO35tiff.OCR.20180205.BIS.stderr (lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ cat DPO35tiff.OCR.20180205.BIS.stderr ocr.nf: command not found

(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ LanguageMachines/PICCL/ocr.nf --inputtype tif --inputdir /vol/tensusers/mreynaert/DPO35/TIF/ --language nld DPO35tiff.OCR.20180205.stdout 2>DPO35tiff.OCR.20180205.stderr (lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ cat DPO35tiff.OCR.20180205.BIS.stderr ocr.nf: command not found

proycon commented 6 years ago

Probably got lost in the server upgrade, I fixed it again now

martinreynaert commented 6 years ago

Did you upgrade the system? I see no difference, so far:

(lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ LanguageMachines/PICCL/ocr.nf --inputtype tif --inputdir /vol/tensusers/mreynaert/DPO35/TIF/ --language nld DPO35tiff.OCR.20180205.stdout 2>DPO35tiff.OCR.20180205.stderr (lamachine16)[mreynaert@scootaloo:/vol/tensusers/mreynaert/DPO35]$ cat DPO35tiff.OCR.20180205.stderr -bash: LanguageMachines/PICCL/ocr.nf: No such file or directory

proycon commented 6 years ago

I should have been more explicit I guess, it's just ocr.nf :)

martinreynaert commented 6 years ago

Heeft gewerkt, proycon! Thanks!

proycon commented 6 years ago

So that means you want the book included in the webservice right?

martinreynaert commented 6 years ago

I got stuck in testing the web version due to the *master.tif extension of the test book. Can I access the available corpora on ponyland to rename these files or can you please do this for me?

proycon commented 6 years ago

Ok, so the conclusion is that we strip the suffixes and adhere to the simple naming convention?

I updated the corpus available for the webservice. All other data is in your download file (see download.nf) so within your control.

proycon commented 6 years ago

Closing this, issues should be resolved, reopen if test fails