cldf / cldf

CLDF: Cross-Linguistic Data Formats - the specification
https://cldf.clld.org
Apache License 2.0
53 stars 17 forks source link

Annotation #66

Closed cysouw closed 6 years ago

cysouw commented 6 years ago

Please review this thoroughly! The idea is to allow annotations to the columns of an alignment. Such columns do not have reference (because they are secondary columns), so we need a special construct for this.

The problem is that this includes something like a foreign key, but not to rows from another table, but to a column of another table.

In detail: instead of specifying the row_ID to another table (and have the ontology figure out which table it should be), we now have to specify the table name (and have the ontology figure out the right column, viz. one of the type #alignment.

This PR would close the discussion in #51

cysouw commented 6 years ago

I added an example to the readme. We might combine Annotation and AnnotationType by using a secondary separator. Would that be a good idea?

Edit: I just looked at it again, and that just makes much more sense. I just removed the AnnotationType column and added a secondary separator. Should have thought about that earlier :-)

cysouw commented 6 years ago

Also: this all becomes a bit problematic because I would also like to have such annotations for Parallel texts. I envisioned this annotations to be useful for different kinds of alignments, but then of course we have to explicitly add the file where the alignment is. Maybe there is a simpler way to solve this?

For example: when such an alignmentAnnotationTable is part of a module:Wordlist dataset, then the annotations refer to cognates.csv. When it is part of a module:ParallelText dataset, then it refers to functionalEquivalents.csv.

@xrotwang How would we specify that in the ontology?

LinguList commented 6 years ago

While this seemed to me like some kind of an extreme overkill at first sight, I see how this flexibility could be used to address many cases of annotation which I applied in a fixed way to be used much more flexibly. For example, what we call "motivation structure" is nothing else than an annotation to a Segments instance with secondary segmentation:

Segments Motivation_structure
h au s + m ei s t e r house master
a ⁵⁵ + i ³⁵ a-prefix aunt

We could use the model by @cysouw to turn this as:

Segment_Slice Annotation AnnotationType
1 house motivation
2 master motivation

And accordingly, but note that here, the Segment_Slice would be the first splitter, not the secondary one, we could handle what I call prosodic structure (paradigmatic example: Chinese syllable, consisting of initial, medial, nucleus, final, tone):

Segments Prosodic_Structure
k w a ŋ ⁵⁵ i m n f t
k w o ³⁵ i m n t
k u ŋ ⁵⁵ i n t

This could then be rendered as:

Segment_Slice Annotation Annotation_Type
1 i prosody
2 m prosody
3 n prosody
4 f prosody
5 t prosody

It looks extremely redundant, but I see the value of being able to also switch between names, and we don't need to be too religious about "motivation structure" and "prosodic structure" so far. There are examples for this, but we're still in the stage of trying to convince people that they should use it.

But this would mean that we'll need a Segmentannotation module, right? And we'd need to specify to which level of segmentation (primary vs. secondary) it applies. And note that this also justifies my use of " + " as a secondary splitter, as it allows me to make both a secondary and a primary segmentation at the same time, while a "+" as separator will prevent this (I will need to segment first on high level, then on low level, even if I don't need it!).

cysouw commented 6 years ago

Currently, the default separator for the Segment_Slice is space " ", but for your first example you simply specify that it should be " + " and your are good to go.

I think you should not try to do both types of annotation in one annotation-file: basically there are two different kinds of annotations you want to make, one splitting by " + " for motivation and one splitting by " " for prosody. You can even specify the column with the + symbol as a column to be ignored.

Note that in the last version of the PR I have merged the two column Annotation and Annotation_Type into the format with a ":" separator:

prosody:i
prosody:m
motivation:house
motivation: master
xrotwang commented 6 years ago

Yes, the EAV model is typically very flexible; but at the cost of losing typed values, and generally blurring the line between data and schema. But maybe we could use such "free form" annotations as the breeding ground for new properties, which upon reaching maturity can be turned into a "proper" property in the ontology.

Still, could we postpone this to CLDF 1.1?

cysouw commented 6 years ago

BTW: I also find this construct rather unpractical, and in my own workflow I mostly do not use separate tables for this. The alternative would be multiple headers (which I use personally), but that is even more difficult to get nicely specified.

I think this kind of annotationTables is really one that I would only have prepared by software, converting it from other formats used personally.

cysouw commented 6 years ago

@xrotwang I'm fine with postponing it for now, as it is not the highest priority. However, I definitely need it for publishing all the data I am working on. What is the timeframe for CLDF 1.1 ?

SimonGreenhill commented 6 years ago

Yes -- I think this is a perfect case for v1.1. We're close to wrapping up 1.0 now, and rushing to add another module is non-optimal (better to work through some of these issues above first before standardising on something that isn't as good as it could be?)

xrotwang commented 6 years ago

@cysouw timeframe for .X releases - i.e. ones which do not break backward compatibility - is rather flexible: As soon as something seems mature enough, maybe only because there's one tool that is making use of it, we'll add it to the spec and release.

So I'd say in your case, you can already publish using additional tables as you describe them here. If the component is accepted, all you'll have to do is adapt the JSON descriptions of your datasets, adding the propertyUrls.

LinguList commented 6 years ago

Still, could we postpone this to CLDF 1.1?

I could still define our "motivation_structure" column in forms.csv for the Hill+List, once I update, I presume. As long as I specify manually, right? In that case, I'm fine with postponing, but if it's not possible, I'd need some solution, as we mention in the paper that we have this annotation.

As to "breeding ground for new ideas": I like this a lot. It will allow me to model many things which I so far handle in additional rather customized columns (often already supported in lingpy/edictor), but it means we'd have some more time to have these things mature and become also recognized by more people, with ideally more examples. So v1.1 is perfect for me.

LinguList commented 6 years ago

@cysouw, as you now convinced me of the usefulness, and as we both seem to work in these directions (even if not with the same data), we should probably communicate and exchange some ideas when drafting this up for v1.1.

xrotwang commented 6 years ago

@LinguList I wouldn't throw out any of the properties we have now in the ontology - even this introduces a bit of legacy right from the start. In the beginning, when the ontology is empty, it just happens to be easier to get stuff in than lateron, when it's starting to be crowded :)

xrotwang commented 6 years ago

At least, the current motivationStructure can serve as a bad example how not to rush including things in the ontology for the future :)

LinguList commented 6 years ago

But the motivation structure is something I am very proud of, I don't want her to be a bad example ;)

cysouw commented 6 years ago

@LinguList your motivation is beautiful :-)

As I see it, there are two issues:

This is a list of things I'm sure we want to include as alignmentAnnotationType because they occur in actual data (see original post in #51):

@xrotwang how would we describe this in the ontology?

LinguList commented 6 years ago

I'll have another thing, which we are currently testing. We call it "correspondence pattern" (I mentioned this quickly in Poznan, but I still need to write up the draft with the algorithm). The idea is to say that certain columns of different alignments are compatible with each other, in fact, this is where classical reconstruction starts, when you have:

alignment segment slice l1 l2 l3 l4
alm-1 1 p Ø p f
alm-2 1 p p p Ø
alm-3 2 p p Ø f

I can compute these now, with a new algorithm which was so far only shown in a few talks, so it is completely fresh, but what we are already able to do is to say: these three columns behave similarly. This is straightforward to add as annotation in the current scheme. EDICTOR computes the patterns already, based on a simplified algorithm, I am working on a better version in Python, so this will be ready for version 1.1.