PICCL pipelines need to do better input validation and provide better error/warning messages to the user + general lack of documentation needs to improve

LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation

Other

48 stars 6 forks source link

PICCL pipelines need to do better input validation and provide better error/warning messages to the user + general lack of documentation needs to improve #37

Closed willstout closed 4 years ago

willstout commented 6 years ago

I'm trying to run ocr.nf with docker and I'm not sure how the parameters are meant to be used.

So for like the --inputdir parameter, we're only supposed to give the folder that contains the images? Does this mean that what ever image files are within that folder are going to be run through the pipeline? And is this file system our normal file system or our docker file system?

So if I want to run a pdf, that's sitting in my desktop folder, through the pipeline, I would run "ocr.nf --inputdir C:\Users\willstout\Desktop --language eng"? Or would I first need to add it to a docker container

ocr.nf is quite confusing to work with because there isn't a lot of documentation on the whole program. In fact running "ocr.nf --help" does the exact same thing as "ocr.nf". Additionally if I wanted to purposefully run something wrong just to see what error I would be given, the program will run the same as if nothing is wrong. For instance running "ocr.nf --inputdir" and not giving it a specified directory sends me back to the starting point of the OCR pipeline. Running with a specified directory just tells me

N E X T F L O W ~ version 0.30.2 Launching `/usr/local/bin/ocr.nf` [desperate_newton] - revision: 76d7839f83

OCR Pipeline

[warm up] executor > local lamachine@eab8a83a33ea:~$

And running this with a directory that doesn't exist gives that same output. So there's no way to tell if what I am doing is correct.

proycon commented 6 years ago

PICCL documentation is unfortunately indeed in a rather minimal state currently, all there is currently is is the README in this repository, which should give some examples. The --help is also complete, but it may be a bit cryptic. Proper PICCL documentation is something for @martinreynaert (the project lead) to pick up when he has time hopefully.

So if I want to run a pdf, that's sitting in my desktop folder..... Or would I first need to add it to a docker container?

Yes, it needs to be put in a spot that is shared between container and host system, you can't reference paths on the host system from within the container. Consult the docker documentation regarding volumes, at https://docs.docker.com/storage/volumes/ to learn how to deal with sharing data.

For instance running "ocr.nf --inputdir" and not giving it a specified directory sends me back to the starting point of the OCR pipeline.

Good point, there should be more checks and more helpful error messages implemented.

For instance running "ocr.nf --inputdir" and not giving it a specified directory sends me back to the starting point of the OCR pipeline.

It probably didn't find the input directory or nor pdf files in it, a message would have been nice yes, definite points for improvement. I'll make that the topic for this issue.

willstout commented 6 years ago

Also it's not really specified where the ocr_output directory is. Where is that?

proycon commented 6 years ago

Since it's a relative path, it will be created in your current working directory. You can set any other path, absolute or relative with the --outputdir parameter.

willstout commented 6 years ago

I'm having issues with this, I have my pdf that I first copy into my docker file system. I successfully do that. Then I run "ocr.nf --inputdir home/lamachine/test.pdf --language eng --outputdir home/lamachine". It appears that I'm successful, there are no error warnings. But things don't seem right.

Prior to me running ocr.nf I had two things in my lamachine directory, bin and test.pdf. Afterwards, I now have three things, bin, test.pdf, and work. This new work directory seems like a good place for my new test.folia.xml file. However, there is nothing within this directory. After some more looking around, I still can't find it. I have no idea where this file ended up, or if it was created at all. Do you have any idea what could be up?

Also I'm not seeing an ocr_output directory in my current working directory. And when providing two different output directories (I tried home/lamachine and home), only one gets that "work" directory, and no matter which output directory I put, the work directory always shows up in my current working directory. I have no idea what's going on.

A similar failure happens with tokenize.nf:

lamachine@latest:~$ tokenize.nf --inputdir home/lamachine/defs.txt --language eng --outputdir home/lamachine
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/tokenize.nf` [friendly_davinci] - revision: 553cb21987
----------------------------------
Tokenisation Pipeline using ucto
----------------------------------
[warm up] executor > local
lamachine@latest:~$ dir
bin  defs.txt  work
lamachine@latest:~$ dir bin
lamachine-activate  lamachine-latest-activate  lamachine-latest-update  lamachine-update  lamachine-update.sh
lamachine@latest:~$ dir work
lamachine@latest:~$

I'm just not seeing where any of the files end up

willstout commented 6 years ago

Still having this issue

proycon commented 6 years ago

You specified home/lamachine (a relative path!!) instead of /home/lamachine (an absolute path), so the system doesn't find the input and doesn't do much (and the warning messages I implemented as per this issue are not in the stable release yet so you don't notice that is what went wrong here).

willstout commented 6 years ago

(Sorry to keep pestering you with this issue) I just can't get it to work, relative or absolute, output parameter or not.

lamachine@2d5a7c62974f:~$ dir
Fire-o.tiff  bin
lamachine@2d5a7c62974f:~$ ocr.nf --input /home/lamachine/Fire-o.tiff --language eng --outputdir /home/lamachine
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [jovial_edison] - revision: 76d7839f83
--------------------------
OCR Pipeline
--------------------------
Usage:
  ocr.nf [PARAMETERS]

Mandatory parameters:
  --inputdir DIRECTORY     Input directory
  --language LANGUAGE      Language (iso-639-3)

Optional parameters:
  --inputtype STR          Specify input type, the following are supported:
          pdf (extension *.pdf)  - Scanned PDF documents (image content) [default]
          tif ($document_$sequencenumber.tif)  - Images per page (adhere to the naming convention!)
          jpg ($document_$sequencenumber.jpg)  - Images per page
          png ($document_$sequencenumber.png)  - Images per page
          gif ($document_$sequencenumber.gif)  - Images per page
          djvu (extension *.djvu)
          (The underscore delimiter may optionally be changed using --seqdelimiter)
  --outputdir DIRECTORY    Output directory (FoLiA documents) [default: /home/lamachine]
  --virtualenv PATH        Path to Python Virtual Environment to load (usually path to LaMachine)
  --pdfhandling reassemble Reassemble/merge all PDFs with the same base name and a number suffix; this can
                           for instance reassemble a book that has its chapters in different PDFs.
                           Input PDFs must adhere to a $document_$sequencenumber.pdf convention.
                           (The underscore delimiter may optionally be changed using --seqdelimiter)
  --seqdelimiter           Sequence delimiter in input files (defaults to: _)
  --seqstart               What input field is the sequence number (may be a negative number to count from the end), default: -2
lamachine@2d5a7c62974f:~$ dir
Fire-o.tiff  bin  work
lamachine@2d5a7c62974f:~$ dir work
lamachine@2d5a7c62974f:~$ dir bin
lamachine-activate  lamachine-latest-activate  lamachine-latest-update  lamachine-update  lamachine-update.sh
lamachine@2d5a7c62974f:~$

The work directory has nothing within it, and as far as I can tell it's the only thing changed or created from running ocr.nf

proycon commented 6 years ago

I just released the new PICCL that contains more input validation (as per this issue), though it is still not ideal. Considering that you keep running into problems related to input parameters/files, can you give it a try whether the new messages make it any clearer for you? (You'll need to update your LaMachine)

As to the above problem, you used a non-existing parameter (--input) with a filename, but (--inputdir, with a directory) is required. The new version should give a decent warning now and not leave you clueless.

willstout commented 6 years ago

Whoops, --input was an accident, however that issue persists. I've looked over my code and made sure that I've haven't made any mistakes you went over too. The notes were good though, I realized I can't use the path to the exact file as the input directory, I've probably been doing that wrong the whole time.

lamachine@de19b96ccfa7:~$ ocr.nf --inputdir /home/lamachine/fire-o.tif --language eng --outputdir /home/lamachine/work
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [romantic_kalam] - revision: f1ea3c4d7b
--------------------------
OCR Pipeline
--------------------------
Error: Specified input directory does not exist
lamachine@de19b96ccfa7:~$ ocr.nf --inputdir /home/lamachine --language eng --outputdir /home/lamachine/work
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [insane_goldstine] - revision: f1ea3c4d7b
--------------------------
OCR Pipeline
--------------------------
[warm up] executor > local
lamachine@de19b96ccfa7:~$ dir
Fire-o.tiff  bin  work
lamachine@de19b96ccfa7:~$ dir work
lamachine@de19b96ccfa7:~$ dir bin
lamachine-activate  lamachine-latest-activate  lamachine-latest-update  lamachine-update  lamachine-update.sh

willstout commented 6 years ago

Okay so good news and bad news. I got it working, but it only seems to work with pdf files, not tif files (I only tested those two file types). During the first run of ocr.nf I had only Fire-o.tiff in my /home/lamachine directory, I also had the --inputtpe parameter of tif. Nothing happened. But after the first run I copied in a pdf version of that tiff file into /home/lamachine and removed the --inputtype parameter. Then it started working, weird but good.

lamachine@de19b96ccfa7:~$ ocr.nf --inputdir /home/lamachine --language eng --inputtype tif --outputdir /home/lamachine/
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [adoring_franklin] - revision: f1ea3c4d7b
--------------------------
OCR Pipeline
--------------------------
[warm up] executor > local
lamachine@de19b96ccfa7:~$ dir
Fire-o.tiff  bin  work
lamachine@de19b96ccfa7:~$ dir work
lamachine@de19b96ccfa7:~$ dir
Fire-o.pdf  Fire-o.tiff  bin  work
lamachine@de19b96ccfa7:~$ ocr.nf --inputdir /home/lamachine --language eng --outputdir /home/lamachine/
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [desperate_bernard] - revision: f1ea3c4d7b
--------------------------
OCR Pipeline
--------------------------
Input document (pdf): /home/lamachine/Fire-o.pdf
[warm up] executor > local
[59/73a75b] Submitted process > pdfimages (1)
[a8/9db76b] Submitted process > bitmap2tif (1)
[b1/ee47a5] Submitted process > tesseract (1)
[96/3b829c] Submitted process > ocrpages_to_foliapages (1)
[3e/f78f7d] Submitted process > foliacat (1)
OCR output document written to /home/lamachine//Fire-o.folia.xml
lamachine@de19b96ccfa7:~$

And for the sake of a little more testing and looking to see if something is up with ocr'ing a tiff file I tried this:

lamachine@f762a969c677:~$ dir
$Fire-o_$0.tiff  Fire-o.pdf  bin
lamachine@f762a969c677:~$ ocr.nf --inputdir /home/lamachine --language eng --inputtype tif --outputdir /home/lamachine
N E X T F L O W  ~  version 0.30.2
Launching `/usr/local/bin/ocr.nf` [happy_mclean] - revision: f1ea3c4d7b
--------------------------
OCR Pipeline
--------------------------
[warm up] executor > local
lamachine@f762a969c677:~$

Even with the correct tiff file notation (I'm pretty sure that's how that's meant to be), ocr.nf doesn't seem to recognize tiffs

proycon commented 6 years ago

(sorry for the delay, my holiday period is starting so I'm more absent the coming weeks)

For tiff the filename indeed needs to correspond to a particular pattern ($ denotes a variable, no need to include that literally in your filename!). Also, as the --help says in the example, it expects the extension tif instead of tiff, it's a bit picky currently.

proycon commented 4 years ago

(closing this after long inactivity, the situation in the latest release today should be better at least, although it's still not ideal)

LanguageMachines / PICCL

PICCL pipelines need to do better input validation and provide better error/warning messages to the user + general lack of documentation needs to improve #37

N E X T F L O W ~ version 0.30.2 Launching /usr/local/bin/ocr.nf [desperate_newton] - revision: 76d7839f83

OCR Pipeline

N E X T F L O W ~ version 0.30.2 Launching `/usr/local/bin/ocr.nf` [desperate_newton] - revision: 76d7839f83