COST-ELTeC / WG2-Sample

0 stars 2 forks source link

WG2-Sample

Repository for the use of WG2 in preparing their white paper on "Annotating European Novels for Distant Reading".

It should contain a total of 100 samples from each of at least 7 different ELTeC repositories, made up of 5 random passages of 400 whitespace-delimited tokens taken from each of 20 novels. Headings should be excluded, but not poetry and each sample should be a well formed XML fragment.

Samples were selected using the selector.xsl stylesheet, as follows:

All tagging except for the <p> delimiting each paragraph is removed. Each <p> uses its @n attribute to supply a locator made by concatenating the text identifier (value of TEI/@xml:id) and the paragraph sequence number.

Each set of five samples is stored in a file named [text-identifier]_sample.xml. All the files for each language are stored in a directory named for the language.

LB 2018-11-19