LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

frog.nf cannot find frog xml output #29

Closed peterdekker closed 6 years ago

peterdekker commented 6 years ago

I am running Frog as part of the LaMachine distribution. When I run the following command: $ nextflow run LanguageMachines/PICCL/frog.nf --inputdir ticcl_output/ --inputformat folia --extension folia.xml --skip=acmpn --outputdir frog_output (same result without --inputformat and --outputdir, or with --extension xml)

I get the following error:

N E X T F L O W  ~  version 0.29.0
Launching `LanguageMachines/PICCL` [disturbed_watson] - revision: c12599e479 [master]
WARN: The config file defines settings for an unknown process: indexer
----------------------------------
Frog pipeline
----------------------------------
WARN: `params.inputformat` is defined multiple times -- Assignments following the first are ignored
[warm up] executor > local
[7b/3e1554] Submitted process > frog_folia2folia (1)
ERROR ~ Error executing process > 'frog_folia2folia (1)'

Caused by:
  Missing output file(s) `*.xml` expected by process `frog_folia2folia (1)`

Command executed:

  set +u
        if [ ! -z "/vol1/lamachine" ]; then
            source /vol1/lamachine/bin/activate
        fi
        set -u

        opts=""
        if [ ! -z "acmpn" ]; then
    opts="--skip=acmpn"
  fi

        #move input files to separate staging directory
        mkdir input
        mv *.xml input/

        #output will be in cwd
        frog $opts --inputclass "current" --outputclass "current" --xmldir "." --threads 1 --nostdout --testdir input/ -x

Command exit status:
  0

Command output:
  (empty)

Command error:
  frog-:Mon May  7 15:00:27 2018 done with sentence[6574]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6575]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6576]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6577]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6578]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6579]
  frog-:Mon May  7 15:00:27 2018 done with sentence[6580]
  frog-:tokenisation took:  21 seconds, 78 milliseconds and 152 microseconds
  frog-:CGN tagging took:   300 seconds, 614 milliseconds and 635 microseconds
  frog-:Mblem took:         4 seconds, 876 milliseconds and 835 microseconds
  frog-:Frogging in total took: 308 seconds, 694 milliseconds and 783 microseconds
  frog-:resulting FoLiA doc saved in ./img.ticcl.folia.xml
  frog-:Mon May  7 15:00:37 2018 Frogging input/img_de_nederlander_1850_ddd_000013854.ticcl.folia.xml
  frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
  frog-:Mon May  7 15:00:37 2018 process 29 sentences
  frog-:Mon May  7 15:00:37 2018 done with sentence[1]
  frog-:Mon May  7 15:00:37 2018 done with sentence[2]
  frog-:Mon May  7 15:00:37 2018 done with sentence[3]
  frog-:Mon May  7 15:00:37 2018 done with sentence[4]
  frog-:Mon May  7 15:00:38 2018 done with sentence[5]
  frog-:Mon May  7 15:00:38 2018 done with sentence[6]
  frog-:Mon May  7 15:00:38 2018 done with sentence[7]
  frog-:Mon May  7 15:00:38 2018 done with sentence[8]
  frog-:Mon May  7 15:00:38 2018 done with sentence[9]
  frog-:Mon May  7 15:00:38 2018 done with sentence[10]
  frog-:Mon May  7 15:00:38 2018 done with sentence[11]
  frog-:Mon May  7 15:00:38 2018 done with sentence[12]
  frog-:Mon May  7 15:00:38 2018 done with sentence[13]
  frog-:Mon May  7 15:00:38 2018 done with sentence[14]
  frog-:Mon May  7 15:00:38 2018 done with sentence[15]
  frog-:Mon May  7 15:00:38 2018 done with sentence[16]
  frog-:Mon May  7 15:00:38 2018 done with sentence[17]
  frog-:Mon May  7 15:00:38 2018 done with sentence[18]
  frog-:Mon May  7 15:00:38 2018 done with sentence[19]
  frog-:Mon May  7 15:00:38 2018 done with sentence[20]
  frog-:Mon May  7 15:00:38 2018 done with sentence[21]
  frog-:Mon May  7 15:00:38 2018 done with sentence[22]
  frog-:Mon May  7 15:00:39 2018 done with sentence[23]
  frog-:Mon May  7 15:00:39 2018 done with sentence[24]
  frog-:Mon May  7 15:00:39 2018 done with sentence[25]
  frog-:Mon May  7 15:00:39 2018 done with sentence[26]
  frog-:Mon May  7 15:00:39 2018 done with sentence[27]
  frog-:Mon May  7 15:00:39 2018 done with sentence[28]
  frog-:Mon May  7 15:00:40 2018 done with sentence[29]
  frog-:tokenisation took:  0 seconds, 89 milliseconds and 363 microseconds
  frog-:CGN tagging took:   2 seconds, 537 milliseconds and 471 microseconds
  frog-:Mblem took:         0 seconds, 16 milliseconds and 267 microseconds
  frog-:Frogging in total took: 2 seconds, 562 milliseconds and 107 microseconds
  frog-:resulting FoLiA doc saved in ./img_de_nederlander_1850_ddd_000013854.ticcl.folia.xml
  frog-:Mon May  7 15:00:40 2018 Frog finished

Work dir:
  /home/piccl/work/7b/3e1554352860cce59cbca95673db69

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

It seems that the Nextflow script cannot find the xml output from frog. This seems to go wrong in lines 72 and 117 of ocr.nf (https://github.com/LanguageMachines/PICCL/blob/master/frog.nf#L72), where the output is defined using a Wildcard. When I run an earlier version of frog.nf, where the output is more explicitly defined, it runs without errors: https://github.com/LanguageMachines/PICCL/commit/b4e05a044d6ae4037c7e435fe26dbb5f6c700f72#diff-b1623eb35be7cba58a6c27b0a3e54453R57

peterdekker commented 6 years ago

Thanks for looking into this, is there already news on a solution? Or are there possible solutions I could try out myself?

proycon commented 6 years ago

Sorry it took a while, looking into this..I managed to replicate the issue just now

proycon commented 6 years ago

Okay, found it... Nextflow excludes files named exactly like the input files in the output, so that's where things went wrong. The above commit should fix it, will do a release right away.

peterdekker commented 6 years ago

Great, thanks much!

peterdekker commented 6 years ago

With the new fix, the Nextflow script exits without errors, and says it has created files:

Frog output document written to frog_output/356417.ticcl.folia.xml

etc.

However, in reality the frog_output directory is empty. I do see an output directory with folia files in work/, so I guess that the copying of the directory to the right location goes wrong. EDIT: The error is in the publishDir lines. It works when I change these to:

publishDir params.outputdir, pattern: "output/*.xml", mode: 'copy', overwrite: true

Except that a redundant subdirectory output/ is created inside frog_output

Also, I was wondering, when invoking frog in the text2folia function, shouldn't there be a directory argument after --testdir? https://github.com/LanguageMachines/PICCL/blob/master/frog.nf#L93

peterdekker commented 6 years ago

@proycon Could this issue be re-opened, based on the new information in my last comment?

proycon commented 6 years ago

Ah right, I forgot to adapt publishDir after that last fix... Now I wonder if nextflow has an option to prevent that redundant output/ dir, or if I need to solve that in yet another task...

proycon commented 6 years ago

Also, I was wondering, when invoking frog in the text2folia function, shouldn't there be a directory argument after --testdir? https://github.com/LanguageMachines/PICCL/blob/master/frog.nf#L93

Yes? There is; the directory is input/

peterdekker commented 6 years ago

Oops my bad, the end of the line fell off in the Github view :/

Regarding the directory issue, would the following be possible? Inside the script, create a directory with the name params.outputdir and use that for frog output. Then, when invoking publishDir, match this directory name and move it to the current directory (instead of to params.outputdir).

proycon commented 6 years ago

I implemented a different solution, the output files now have "frogged" in their filename (*.frogged.folia.xml) so they don't clash with the input.