=pod
=encoding utf8
=head1 NAME
tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
=head1 SYNOPSIS
cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip
=head1 DESCRIPTION
C
This program is usually called from inside another script.
=head1 FORMATS
=head2 Input restrictions
=over 2
=item
TEI P5 formatted input with certain restrictions:
=over 4
=item
B
=item
B
=back
=item
All tokens inside the primary text may not be
newline seperated, because newlines are removed
(see L
=item
Header types, like C<E
=back
=head2 Notes on the output
=over 2
=item
zip file output (default on C
=back
=head1 INSTALLATION
C
$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
In case everything went well, the C
Minimum requirement for L
=head1 OPTIONS
=over 2
=item B<--input|-i>
The input file to process. If no specific input is defined and a single
dash C<-> is passed as an argument, data is read from C
=item B<--root|-r>
The root directory for output. Defaults to C<.>.
=item B<--help|-h>
Print help information.
=item B<--version|-v>
Print version information.
=item B<--tokenizer-korap|-tk>
Use the standard KorAP/DeReKo tokenizer.
=item B<--tokenizer-internal|-ti>
Tokenize the data using two embedded tokenizers,
that will take an I
=item B<--tokenizer-call|-tc>
Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.
Texts are separated using C<\x04\n>. The external process should add a new line per text.
If the L</--use-tokenizer-sentence-splits> option is activated, sentences are marked by offset as well in new lines.
To use L<Datok|https://github.com/KorAP/Datok> including sentence
splitting, call C
$ cat corpus.i5.xml | tei2korapxml -s \ $ -tc 'datok tokenize \ $ -t ./tokenizer.matok \ $ -p --newline-after-eot --no-sentences \ $ --no-tokens --sentence-positions -' - \ $ > corpus.korapxml.zip
=item B<--skip-inline-tokens>
Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).
=item B<--skip-inline-token-annotations>
Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed).
=item B<--skip-inline-tags>
Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.
=item B<--xmlid-to-textsigle> from-regex>@<to-c/to-d/to-t
Expects a regular replacement expression (separated by B<@> between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by B</>).
Example:
tei2korapxml \ --xmlid-to-textsigle 'ICC.German.([^.]+.[^.]+).(.+)@ICCGER/$1/$2' \ -tk - < t/data/icc_german_sample.p5.xml
Converts text id C
=item B<--inline-tokens>
Define the foundry and file (without extension)
to store inline token information in.
Unless C<--skip-inline-token-annotations> is set,
this will contain annotations as well.
Defaults to C
The inline token data will also be stored in the inline structures file (see I<--inline-structures>), unless the inline token foundry is prepended by an B<!> exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.
Example:
tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
=item B<--inline-structures>
Define the foundry and file (without extension)
to store inline structure information in.
Defaults to C
=item B<--base-foundry>
Define the base foundry to store newly generated
token information in.
Defaults to C
=item B<--data-file>
Define the file (without extension) to store primary data information in. Defaults to C.
=item B<--header-file>
Define the file name (without extension)
to store header information on
the corpus, document, and text level in.
Defaults to C
=item B<--use-tokenizer-sentence-splits|-s>
Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.
=item B<--tokens-file>
Define the file (without extension)
to store generated token information in
(either from the KorAP tokenizer or an externally called tokenizer).
Defaults to C
=item B<--log|-l>
Loglevel for I
=back
=head1 ENVIRONMENT VARIABLES
=over 2
=item B
Activate minimal debugging.
Defaults to C
=back
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Author: Peter Harders
Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
L
This program is free software published under the L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
=cut