LingSIG / wordAttributes

work space for a coherent proposal for inline attributes of <w> in TEI XML
1 stars 1 forks source link

related work #6

Closed eduarddrenth closed 6 years ago

eduarddrenth commented 7 years ago

Dear people,

In this repo: https://bitbucket.org/teibestpractices/linguistic-customization you will find related work.

Perhaps you are interested in merging / collaborating?

eduarddrenth commented 7 years ago

repo wasn't public, now it is...

bansp commented 7 years ago

Dear Eduard, Great, thanks for this. I'll send you an invite from the repository that you are welcome to use as you see fit ;-) Myself, I'm going to be away on vacation for a week, having deliberately and forcefully ;-) left my laptop behind.

eduarddrenth commented 7 years ago

Also this https://proycon.github.io/folia/ may be interesting as input since it is focuses on linguistic encoding in xml. See opensonar and nederlab for large corpora using folia.

bansp commented 7 years ago

Thanks again for the link and for your participation in the discussion. My impression (and I admit to having primarily looked at the homepage rather than inside the specs; Susanne looked at the specs in more detail and shared with me) is that our goals differ. It seems that you are attempting to create a robust system of inline annotation. This is something where I (I can only speak for myself) would resort to multi-layer annotation with clear separation of e.g. tokenization and morphological/morphosyntactic layers, and again a separate syntactic layer (or: multiple separate syntactic layers).

On the contrary, what we intentionally attempt here, in the feature request produced in the "wordAttributes" project, is to address relatively shallow annotation. Our goal here is to be simple bordering on simplistic. This is why the proposal concerns modifications directly in the TEI namespace: we really want to introduce minimal modifications in the off-the-shelf TEI which would cater to the crowd who simply want to have POS information and potentially info on morphosyntax (if it comes separate from POS). Since a large part of resources that require such simple information containers are historical corpora, we also talk about @reg, to make the picture reasonably complete. And the final attribute is a result of a certain technological feature that many annotators still don't realise nowadays, namely the import of the invisible markup in the form of whitespace. That is all.

We intentionally leave many issues aside, like, for example, we only brush against the issue of multi-token word forms -- we don't want to solve it, at this level. We rather want to say that this issue sets one of the borders of our approach.

Thanks for sharing! :-)

bansp commented 6 years ago

Thanks for the input, closing this issue after a merger with the TEI main branch.