=pod
=encoding utf8
=head1 NAME
korapxml2krill - Merge KorAP-XML data and create Krill documents
=head1 SYNOPSIS
$ korapxml2krill [archive|extract] --input <directory|archive> [options]
=head1 DESCRIPTION
L
=head1 INSTALLATION
The preferred way to install L
$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
In case everything went well, the C
=head1 ARGUMENTS
$ korapxml2krill -z --input
Without arguments, C
=over 2
=item B
$ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Converts an archive of KorAP-XML documents. It expects a directory (pointing to the corpus level folder) or one or more zip files as input.
=item B
$ korapxml2krill extract --input
Extracts KorAP-XML documents from a zip file.
=item B
$ korapxml2krill serial -i
Convert archives sequentially. The inputs are not merged but treated as they are (so they may be premerged or globs). the C<--out> directory is treated as the base directory where subdirectories are created based on the archive name. In case the C<--to-tar> flag is given, the output will be a tar file.
=item B
$ korapxml2krill slimlog
Filters out all useless aka succesfull information from logs, to simplify log checks. Expects no further options.
=back
=head1 OPTIONS
=over 2
=item B<--input|-i> <directory|zip file>
Directory or zip file(s) of documents to convert.
Without arguments, C
C
-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Input may also be defined using BSD glob wildcards.
-i 'file/news*.zip'
The extended input array will be sorted in length order, so the shortest path needs to contain all primary data files and all meta data files.
(The directory structure follows the base directory format that may include a C<.> root folder. In this case further archives lacking a C<.> root folder need to be passed with a hash sign in front of the archive's name. This may require to quote the parameter.)
To support zip files, a version of C
B<The root folder switch using the hash sign is experimental and may vanish in future versions.>
=item B<--input-base|-ib>
The base directory for inputs.
=item B<--output|-o> <directory|file>
Output folder for archive processing or
document name for single output (optional),
writes to C
=item B<--overwrite|-w>
Overwrite files that already exist.
=item B<--token|-t>
Define the default tokenization by specifying the name of the foundry and optionally the name of the layer-file. Defaults to C<OpenNLP#tokens>. This will directly take the file instead of running the layer implementation!
=item B<--base-sentences|-bs>
Define the layer for base sentences. If given, this will be used instead of using C<Base#Sentences>. Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional layers supported.
Defaults to unset.
=item B<--base-paragraphs|-bp>
Define the layer for base paragraphs. If given, this will be used instead of using C<Base#Paragraphs>. Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional layer supported.
Defaults to unset.
=item B<--base-pagebreaks|-bpb>
Define the layer for base pagebreaks. Currently C<DeReKo#Structure> is the only layer supported.
Defaults to unset.
=item B<--skip|-s>
Skip specific annotations by specifying the foundry
(and optionally the layer with a C<#>-prefix),
e.g. C
=item B<--anno|-a>
Convert specific annotations by specifying the foundry
(and optionally the layer with a C<#>-prefix),
e.g. C
=item B<--non-word-tokens|-nwt>
Tokenize non-word tokens like word tokens (defined as matching C</[\d\w]/>). Useful to treat punctuations as tokens.
Defaults to unset.
=item B<--non-verbal-tokens|-nvt>
Tokenize non-verbal tokens marked as in the primary data as the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
Defaults to unset.
=item B<--jobs|-j>
Define the number of spawned forks for concurrent jobs of archive processing. Defaults to C<0> (everything runs in a single process).
If C
Pass C<-1>, and the value will be set automatically to 5
times the number of available cores, in case L
This is I
=item B<--job-count|-jc>
Print job and core information that would be used if C<-1> was passed to C<--jobs>.
=item B<--koral|-k>
Version of the output format. Supported versions are: C<0> for legacy serialization, C<0.03> for serialization with metadata fields as key-values on the root object, C<0.4> for serialization with metadata fields as a list of C"@type":"koral:field" objects.
Currently defaults to C<0.03>.
=item B<--sequential-extraction|-se>
Flag to indicate, if the C
=item B<--meta|-m>
Define the metadata parser to use. Defaults to C
=item B<--gzip|-z>
Compress the output. Expects a defined C
=item B<--cache|-c>
File to mmap a cache (using L
=item B<--cache-size|-cs>
Size of the cache. Defaults to C<50m>.
=item B<--cache-init|-ci>
Initialize cache file.
Can be flagged using C<--no-cache-init> as well.
Defaults to C
=item B<--cache-delete|-cd>
Delete cache file after processing.
Can be flagged using C<--no-cache-delete> as well.
Defaults to C
=item B<--config|-cfg>
Configure the parameters of your call in a file of key-value pairs with whitespace separator
overwrite 1 token DeReKo#Structure ...
Supported parameters are:
C
Configuration parameters will always be overwritten by passed parameters.
=item B<--temporary-extract|-te>
Only valid for the C
This will first extract all files into a directory and then will archive. If the directory is given as C<:temp:>, a temporary directory is used. This is especially useful to avoid massive unzipping and potential network latency.
=item B<--to-tar>
Only valid for the C
Writes the output into a tar archive.
=item B<--sigle|-sg>
Extract the given texts.
Can be set multiple times.
I<Currently only supported on C
=item B<--lang>
Preferred language for metadata fields. In case multiple titles are
given (on any level) with different C
=item B<--log|-l>
The L
=item B<--quiet>
Silence all information (non-log) outputs.
=item B<--help|-h>
Print help information.
=item B<--version|-v>
Print version information.
=back
=head1 PERFORMANCE
There are some ways to improve performance for large tasks:
=item First unpack
Using the archive or serial command on one or multiple zip files can be very slow, as it needs to unpack small portions every time. It's better to use C<--temporary-extract> to unpack the whole archive first into a temprary directory and then read the extracted files. This is especially important for remote archives
=item Limit annotations
Per default, all supported annotation layers are sought. This can be limited by adding C<--skip '#ALL'> and only listing the expected annotations with C<--anno>.
=item Checking the parallel job count
By providing the number of parallel jobs using C<--jobs>, the execution can be tailored to specific hardware environments.
=head1 ANNOTATION SUPPORT
L
Base
#Sentences
Connexor
#Phrase
#Sentences
#Syntax
CoreNLP
#Morpho
#NamedEntities
#Sentences
CorpusExplorer
CMC
DeReKo
DGD
#Structure
DRuKoLa
Glemm
Gingko
HNC
LWC
Malt
MarMoT
Mate
#Morpho
MDParser
NKJP
#NamedEntities
OpenNLP
#Sentences
RWK
#Structure
Sgbr
#Morpho
Spacy
Talismane
#Morpho
TreeTagger
#Sentences
UDPipe
#Morpho
XIP
#Morpho
#Sentences
More importers are in preparation.
New annotation importers can be defined in the C
=head1 METADATA SUPPORT
L
=over 2
=item B
Meta data for all I5 files
=item B
Meta data from the Schreibgebrauch project
=item B
Meta data from the Gingko project in addition to I5
=item B
Meta data for the ICC in addition to I5
=item B
Meta data for the NKJP corpora
=back
New meta data importers can be defined in the C
The I5 metadata definition is based on TEI-P5 and supports C<E
...
that are directly translated to Krill objects. The supported values are:
=over 2
=item C
=over 4
=item C
String meta data value
=item C
String meta data value that can be given multiple times
=item C
String meta data value that is tokenized and can be searched as token sequences
=item C
Date meta data value (as "yyyy/mm/dd" with optional granularity)
=item C
Numerical meta data value
=item C
Non-indexed meta data value (only retrievable)
=item C
Non-indexed attached URI, takes the desc as the title for links
=back
=item C
The key of the meta object that may be prefixed by C
=item C
A prefixed namespace of the key
=item C
A description of the key
=item text content
The value of the meta object
=back
=head1 About KorAP-XML
KorAP-XML (Bański et al. 2012) is an implementation of the KorAP data model (Bański et al. 2013), where text data are stored physically separated from their interpretations (i.e. annotations). A text document in KorAP-XML therefore consists of several files containing primary data, metadata and annotations.
The structure of a single KorAP-XML document can be as follows:
The C
Metadata is available in the TEI-P5 variant I5
(Lüngen and Sperberg-McQueen 2012). See the documentation in
L
Annotations correspond to a variant of the TEI-P5 feature structures
(TEI Consortium; Lee et al. 2004).
Annotation feature structures refer to character sequences of the primary text
inside the C
The C
Multiple KorAP-XML documents are organized on three levels following
the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
corpus E
A single text can be identified by the concatenation of
the corpus identifier, the document identifier and the text identifier.
This identifier is called the text sigle
(e.g. a text with the identifier C<18486> in the document C<060> in the
corpus C
These corpora are often stored in zip files, with which C
Examples for KorAP-XML files are included in L
=head2 References
Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011): KorAP data model: first approximation, December.
Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012): "The New IDS Corpus Analysis Platform: Challenges and Prospects", Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012). L<PDF|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013): "Robust corpus architecture: a new look at virtual collections and data access", Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25. L<PDF|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck, Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004): "Towards an international standard on featurestructure representation", Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004), pp. 373-376. L<PDF|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
Harald Lüngen and C. M. Sperberg-McQueen (2012): "A TEI P5 Document Grammar for the IDS Text Model", Journal of the Text Encoding Initiative, Issue 3 | November 2012. L<PDF|https://journals.openedition.org/jtei/pdf/508>
TEI Consortium, eds: "Feature Structures", Guidelines for Electronic Text Encoding and Interchange. L<html|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
=head1 AVAILABILITY
https://github.com/KorAP/KorAP-XML-Krill
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2015-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Author: L<Nils Diewald|https://www.nils-diewald.de/>
Contributor: Eliza Margaretha, Marc Kupietz
L
This program is free software published under the L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
=cut