LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

PICCL trips over some filenames that can't be converted to an XML NCName ID #61

Closed proycon closed 2 years ago

proycon commented 3 years ago

Reported by @dietervu , this affects uploads from the CLARIN switchboard:

Caused by:
  Process `txt2folia (1)` terminated with an error exit status (6)

Command executed:

  #!/bin/bash
  #set up the virtualenv (bit unelegant currently, but we have to do this for each process to ensure the LaMachine environment works)
  set +u
  if [ ! -z "" ]; then
      source /bin/activate
  fi
  set -u

  FoLiA-txt --class OCR -t 1 -O . "1-009be6ad3eb2c43f5c6d56a91076511816.txt" || exit 1

  if [ ! -s "1-009be6ad3eb2c43f5c6d56a91076511816.folia.xml" ]; then
      echo "ERROR: Expected output 1-009be6ad3eb2c43f5c6d56a91076511816.folia.xml does not exist or is empty">&2
      exit 6
  fi
kosloot commented 3 years ago

this will be fixed in the next release of foliautils

proycon commented 2 years ago

(was fixed long ago, forgot to close issue)