TICCLtools is a collection of programs to process text data files towards fully-automatic lexical corpus post-correction. Together they constitute the bulk of TICCL: Text Induced Corpus-Cleanup. This software is usually invoked by the pipeline system PICCL: https://github.com/LanguageMachines/PICCL , consult there for installation and usage instructions unless you really want to invoke the individual tools manually.
The workflows in PICCL, the Philosophical Integrator of Computational and Corpus Libraries are schematically visualised here, TICCL being the one to the right:
Preparation for a specific language and its alphabet:
Note: A fairly wide range of language specific alphabet and character confusion files are available online, precluding the need for performing this preparatory step yourself.
We have prepared TICCL for work in many languages, mainly on the basis of available open source lexicons due to Aspell. The language specific files are available here:
Unpack in your main TICCL directory. A subdirectory data/int/
will be
created to house the required files for the specific language(s).
Should you want or need to build your own TICCL alphabet and character confusion files yourself, the tool to do that is:
TICCL-lexstat
Note that each extra character allowed to be an actual character used in the language expands the search space for lexical variants. The tool therefore allows you to 'clip' or apply a frequency cut-off to the character frequency list for your particular language.
The actual TICCL post-correction programs in this collection are:
TICCL-chainclean
Post-TICCLtools: actual text editing:
We currently only provide for post-editing of texts based on the list of correction candidates collected by TICCLtools for texts or corpora in FoLiA XML. Please see the FoLiA-utils (https://github.com/LanguageMachines/foliautils) collection for the tool: FoLiA-correct.
We provide containers for simple installation, see the next section. If you want to build and install manually on a Linux/BSD system instead, follow these instructions:
First ensure the following dependencies are installed on your system:
sudo apt install make gcc g++ autoconf automake autoconf-archive libtool autotools-dev libicu-dev libxml2-dev libbz2-dev zlib1g-dev
First git clone
this repository, enter its directory and build as follows:
$ sudo ./build-deps.sh && ./bootstrap.sh && ./configure && make && sudo make install
If you have no root permissions, set environment variable PREFIX
to the
target directory where you want to install (ensure it exists), the one in the following example is
a sane default:
$ export PREFIX="$HOME/.local/"
$ ./build-deps.sh && ./bootstrap.sh && ./configure --prefix "$PREFIX" && make && make install
Adjust your environment accordingly so the binary and libraries in $PREFIX
can be found: On Linux, ensure the value of $PREFIX/lib
is added to your
$LD_LIBRARY_PATH
and $PREFIX/bin
directory to your $PATH
.
A pre-made container image can be obtained from Docker Hub as follows:
docker pull proycon/ticcltools
You can build a docker container as follows, make sure you are in the root of this repository:
docker build -t proycon/ticcltools .
This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:
docker build -t proycon/ticcltools --build-arg VERSION=development .
Run the container interactively as follows:
docker run -t -i proycon/ticcltools
Or invoke the tool you want:
docker run proycon/ticcltools TICCL-rank
Add the -v /path/to/your/data:/data
parameter (before -t
) if you want to mount your data volume into the container at /data
.