TEIC / TEI-Simple

Legacy Repository: TEI SimplePrint now merged into TEI Repository. Originally TEI Simple aimed to define a new highly-constrained and prescriptive subset of the Text Encoding Initiative (TEI) Guidelines suited to the representation of early modern and modern books, a formally-defined set of processing model rules that enable web applications to easily present and analyze the encoded texts, mapping to other ontologies, and processes to describe the encoding status and richness of a TEI digital text.
50 stars 12 forks source link

The behaviour attribute value doesn't specify it's parsing. #8

Open buckett opened 9 years ago

buckett commented 9 years ago

There's no specification on how the behaviour attribute's value should parsed. How should strings, URIs and XPath expressions should be quoted.

buckett commented 9 years ago

When attempting to parse a behaviour="cit(.,'uri://something') it would be good to know how I should parse the arguments.

sebastianrahtz commented 9 years ago

in tei-pm, there should be a datatype for each parameter of a function. That should deal with this? XPaths are not quoted, strings are.

buckett commented 9 years ago

For example how is a " escaped in a string? I'm guessing the existing implementation treats the function as an XSLT function and so the parsing rules are the same as XSLT function parsing rules.

sebastianrahtz commented 9 years ago

um. we have no idea! we don't know how we'd handle that in XSLT.

buckett commented 9 years ago

So are strings assumed to be XML encoded, so a string of "Hello" said the policeman should be written as "Hello" said the policeman ?

sebastianrahtz commented 9 years ago

That doesn't help you, because the XML parser expands the entities into Unicode anyway. I honestly dont know how to deal with this.

On 24 March 2015 at 15:59, Matthew Buckett notifications@github.com wrote:

So are strings assumed to be XML encoded, so a string of "Hello" said the policeman should be written as "Hello" said the policeman ?

— Reply to this email directly or view it on GitHub https://github.com/TEIC/TEI-Simple/issues/8#issuecomment-85575868.

Sebastian Rahtz

Director (Research) of Academic IT

University of Oxford IT Services

13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Não sou nada.

Nunca serei nada.

Não posso querer ser nada.

À parte isso, tenho em mim todos os sonhos do mundo.

buckett commented 9 years ago

This came about because an XPath expression may contain a comma (I think) so I was thinking about how to parse the function to extract out the 2 XPath expressions for alternate(xpath,xpath)

sebastianrahtz commented 9 years ago

ah. I see where you are going how. and I just met a similar problem. I just wrote behaviour="break('page',if (@n) then @n else @facs)" and it doesn't look right at all.

I am beginning to think we should change this spec to say that the XPath expression should be passed as a string, i.e. surrounded by quotes. Doesn't help with how to pass quotes, but does deal with the embedded comma.

Conal-Tuohy commented 9 years ago

I suggest elements should be used instead of attributes (for behaviour and predicate). Otherwise I think this is going to be a source of endless pain. EDIT: on the other hand, if this stuff will typically be implemented in XSLT etc then perhaps it makes sense to use attributes, so that encoders are forced to write XPath expressions in a way that will work in XSLT, however awkward it may make certain expressions.

Conal-Tuohy commented 9 years ago

Since this is XPath 2, we have the codepoints-to-string() function, but it's not pretty.

"concat(codepoints-to-string(34), 'Hello', codepoints-to-string(34), ' said the policeman')"

sebastianrahtz commented 9 years ago

It's a fair point, Conal. I don't want to change horses mid-race when the problem right now is checking functionality is there, but after we have a stable 1.0 using attributes, it would be a good idea to reconsider the choice of using attributes rather then element children.

martinmueller39 commented 9 years ago

I've compared the TEI Simple dtd with the DTA schema. Simple is more generous than DTA, but DTA has the following elements that Simple does not allow for:

addName country foreName genName nameLink orgName persName roleName surname

Should we include them? I can see three different arguments in favour of doing so. First, DTA has been adopted by CLARIN as its base format. Other things being equal, there is a benefit if a text in that format validates under Simple.

Second, and perhaps more substantively, named entity extraction seems to be the chief, and often the only, thing that people are interested in when they work with texts.

Third, when I showed Simple to the Perseus folks, they were very interested in the processing model but objected to the exclusion of the name elements.

On the minus side, you can just use type attributes for sub specification of names, and Simple may run the risk of no longer being simple. Do we want to slide down that slippery slope?

tuurma commented 9 years ago

I think we quite consciously have made the decision of excluding 'syntactic sugar' options for types and subtypes of names, all for the sake of leaving the editor with precisely one way of encoding things. To accommodate DTA and other corpora we provided a conversion piece from 'general TEI' to 'Simple TEI' that converts all &co into typed

. Funnily enough I can't find the conversion stylesheet on gitHub now. On 8 April 2015 at 15:19, martinmueller39 notifications@github.com wrote: > I've compared the TEI Simple dtd with the DTA schema. Simple is more > generous than DTA, but DTA has the following elements that Simple does not > allow for: > > addName > country > foreName > genName > nameLink > orgName > persName > roleName > surname > > Should we include them? I can see three different arguments in favour of > doing so. First, DTA has been adopted by CLARIN as its base format. Other > things being equal, there is a benefit if a text in that format validates > under Simple. > > Second, and perhaps more substantively, named entity extraction seems to > be the chief, and often the only, thing that people are interested in when > they work with texts. > > Third, when I showed Simple to the Perseus folks, they were very > interested in the processing model but objected to the exclusion of the > name elements. > > On the minus side, you can just use type attributes for sub specification > of names, and Simple may run the risk of no longer being simple. Do we want > to slide down that slippery slope? > > — > Reply to this email directly or view it on GitHub > https://github.com/TEIC/TEI-Simple/issues/8#issuecomment-90930437.
sebastianrahtz commented 9 years ago

the naming thing is hard. we can put back all the specific ones, but then we'd have to remove the generic @type version. would that actually be better? i.e. not to support at all?

the conversion stylesheet is now in the TEI Stylesheets