Closed eduarddrenth closed 6 years ago
repo wasn't public, now it is...
Dear Eduard, Great, thanks for this. I'll send you an invite from the repository that you are welcome to use as you see fit ;-) Myself, I'm going to be away on vacation for a week, having deliberately and forcefully ;-) left my laptop behind.
Also this https://proycon.github.io/folia/ may be interesting as input since it is focuses on linguistic encoding in xml. See opensonar and nederlab for large corpora using folia.
Thanks again for the link and for your participation in the discussion. My impression (and I admit to having primarily looked at the homepage rather than inside the specs; Susanne looked at the specs in more detail and shared with me) is that our goals differ. It seems that you are attempting to create a robust system of inline annotation. This is something where I (I can only speak for myself) would resort to multi-layer annotation with clear separation of e.g. tokenization and morphological/morphosyntactic layers, and again a separate syntactic layer (or: multiple separate syntactic layers).
On the contrary, what we intentionally attempt here, in the feature request produced in the "wordAttributes" project, is to address relatively shallow annotation. Our goal here is to be simple bordering on simplistic. This is why the proposal concerns modifications directly in the TEI namespace: we really want to introduce minimal modifications in the off-the-shelf TEI which would cater to the crowd who simply want to have POS information and potentially info on morphosyntax (if it comes separate from POS). Since a large part of resources that require such simple information containers are historical corpora, we also talk about @reg
, to make the picture reasonably complete. And the final attribute is a result of a certain technological feature that many annotators still don't realise nowadays, namely the import of the invisible markup in the form of whitespace. That is all.
We intentionally leave many issues aside, like, for example, we only brush against the issue of multi-token word forms -- we don't want to solve it, at this level. We rather want to say that this issue sets one of the borders of our approach.
Thanks for sharing! :-)
Thanks for the input, closing this issue after a merger with the TEI main branch.
Dear people,
In this repo: https://bitbucket.org/teibestpractices/linguistic-customization you will find related work.
Perhaps you are interested in merging / collaborating?