cldf / cldf

CLDF: Cross-Linguistic Data Formats - the specification
https://cldf.clld.org
Apache License 2.0
51 stars 17 forks source link

How to model Texts #151

Closed xrotwang closed 6 months ago

xrotwang commented 7 months ago

Texts, i.e. multiline IGTs can only be modeled implicitly now, because there's no way explicitly group and order the lines.

A lightweight solution could be adding a list-valued exampleReference property to ContributionTable. But maybe we'd want to add text-specific metadata, too, e.g. genre.

xrotwang commented 7 months ago

@Glottotopia thoughts? E.g. is there a somewhat controlled/comprehensive vocabulary of text genres?

xrotwang commented 7 months ago

https://tsezacp.clld.org/ is one relevant dataset.

xrotwang commented 7 months ago

@Glottotopia regarding multiple translations for an IGT line: This could also be modeled using a separate, custom table alternative_translations.csv, with columns languageReference, exampleReference and translatedText.

fmatter commented 6 months ago

I've been modeling texts like this for now:

https://fl.mt/cldf-ldd/latest/components/texts/

It's based on your lapollaqiang dataset, I just put in a json column for metadata.

Glottotopia commented 6 months ago

@Glottotopia thoughts? E.g. is there a somewhat controlled/comprehensive vocabulary of text genres?

We could have a look at language archives and see what is stored there.

There is always the risk of eurocentrism.

There are genres like "sonnet" which are clearly culture specific, but some things like "narrative", "recipe" or "song" might have a higher cross-cultural validity, to be determined.

I would rather go for a very small set of genres and only provide genres which are likely to be actually queried for.

Glottotopia commented 6 months ago

I've been modeling texts like this for now:

https://fl.mt/cldf-ldd/latest/components/texts/

It's based on your lapollaqiang dataset, I just put in a json column for metadata.

so, currently examples.csv seems to be an unordered set. This ignores the fact that csv has an inherent linearity. Row 4 follows row 3, and this order will be preserved on updated, changes of row content etc.

There are at least three ways to model linear order:

1) implicit

ID  text       
1   ich kam 
2   ich sah 
3   ich siegte

2) links within table

ID text        next
1  ich kam     2
2  ich sah     3
3  ich siegte  NULL

3) external table

examples.csv

ID text       
1   ich kam 
2   ich sah 
3   ich siegte

order.csv

1 sentences.csv  1 
2 sentences.csv  2 
3 sentences.csv  3 
Glottotopia commented 6 months ago

Option 2 would allow for arrangement of texts where the order of presentation is different from the order of storage. Option 3 would allow to model different orders concurrently.

I see no use cases for either of them, so for me, there is no need to go for anything more complex than Option 1

LinguList commented 6 months ago

We have been using a column "Line_Number" in all those cases where we modeled Chinese data in CLDF, which are not glossed in the strict sense. Another possibility, which we are exploring now for texts, is to model them on a per-word basis, adding info on units (phrase, sentence) with order in extra columns.

LinguList commented 6 months ago

Here's an example on how I handled texts in Chinese rhymes (which is probably not Eurocentric ;-)

Glottotopia commented 6 months ago

I am afraid I cannot see the example :(

Glottotopia commented 6 months ago

Line_Number should probably be fine. Values must be unique, and should be continuous integers.

LinguList commented 6 months ago

https://github.com/hanproj/baxterocrhymes

xrotwang commented 6 months ago

I've been modeling texts like this for now:

https://fl.mt/cldf-ldd/latest/components/texts/

Maybe TextCorpus could rather be a CLDF module, i.e. a collection of CLDF components with module-specific semantics. A ParameterTable in a StructureDataset is typically the list of features while it is the list of concepts in a Wordlist. So in a TextCorpus, the ExampleTable would be the list of sentences and ContributionTable could be the list of texts.

xrotwang commented 6 months ago

To elaborate on the "TextCorpus as module" idea: This model is almost followed for the texts of the Tsez Annotated Corpus. Almost, because what I actually did was extracting a morpheme dictionary from the glossed sentences and then turn the result into a CLDF Dictionary. While this "works", I think it is not very transparent and with a CLDF Module TextCorpus, the texts would actually belong into a TextCorpus - and we may use a FormTable in a TextCorpus to provide a reverse lookup like the one used for Tsez: https://tsezacp.clld.org/units

xrotwang commented 6 months ago

And one more thought on "TextCorpus as module": I think modules add a useful abstraction layer that helps with re-use (e.g. the Wordlist / StructureDataset distinction at least adjusts expectations, but also helps with automated re-use: plotting dots on maps for wordlists is typically not a good idea, etc.).

There's a small price to pay for this: There are data collections which have more than one type of data. E.g. the Hindu Kush Areal Typology data contains typological data as well as wordlists. Thus, this data is partitioned into two CLDF datasets. While this makes CLDF creation (via cldfbench) a bit more complicated, it's pretty transparent for data consumers. There's two CLDF metadata files in https://github.com/cldf-datasets/liljegrenhindukush/tree/master/cldf but they can even share the LanguageTable. Thus, a StructureDataset could easily include the sentences of a TextCorpus and re-purpose them as examples for structural features.

fmatter commented 6 months ago

My Yawarana corpus dataset has an exampleparts.csv table that links wordforms.csv to examples.csv (with a positional index). Because morphs.csv are linked in the same way to wordforms.csv (and morphs.csv are linked via Morpheme_ID to morphemes.csv), a reverse lookup can easily be done from morpheme to example, and because examples.csv has a Text_ID (and a Sentence_Number) column, the example is also linked to the text.

You can see an example of said lookup here: https://yawarana-sketch.herokuapp.com/morphs/akere-with

xrotwang commented 6 months ago

I see you are using a Generic dataset for the corpus. So I guess you would be ok with a more meaningful module for this, too?

Florian Matter @.***> schrieb am Di., 9. Jan. 2024, 17:11:

My Yawarana corpus dataset https://github.com/caribank/yawarana-corpus-cldf has an exampleparts.csv table that links wordforms.csv to examples.csv (with a positional index). Because morphs.csv are linked in the same way to wordforms.csv (and morphemes.csv to morphs.csv), a reverse lookup can easily be done from morpheme to example, and because examples.csv has a Text_ID (and a Sentence_Number) column, the example is also linked to the text.

You can see an example of said lookup here: https://yawarana-sketch.herokuapp.com/morphs/akere-with

— Reply to this email directly, view it on GitHub https://github.com/cldf/cldf/issues/151#issuecomment-1883344121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKCCOHG2BKAAW7P6773YNVTZ3AVCNFSM6AAAAABAWBCZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBTGM2DIMJSGE . You are receiving this because you were assigned.Message ID: @.***>

fmatter commented 6 months ago

Absolutely -- I only used Generic because there's nothing else (and I didn't want to go through the hassle of trying to implement my own module...). As long as I can have the functionalities I need, I'm good.

https://fl.mt/cldf-ldd/latest/components/ gives a good overview of what I currently have.

xrotwang commented 6 months ago

Multi-level ordering of lines, i.e. rows in ExampleTable could be achieved with a list-valued ordinal column.