Closed xrotwang closed 6 months ago
@Glottotopia thoughts? E.g. is there a somewhat controlled/comprehensive vocabulary of text genres?
https://tsezacp.clld.org/ is one relevant dataset.
@Glottotopia regarding multiple translations for an IGT line: This could also be modeled using a separate, custom table alternative_translations.csv
, with columns languageReference
, exampleReference
and translatedText
.
I've been modeling texts like this for now:
https://fl.mt/cldf-ldd/latest/components/texts/
It's based on your lapollaqiang dataset, I just put in a json column for metadata.
@Glottotopia thoughts? E.g. is there a somewhat controlled/comprehensive vocabulary of text genres?
We could have a look at language archives and see what is stored there.
There is always the risk of eurocentrism.
There are genres like "sonnet" which are clearly culture specific, but some things like "narrative", "recipe" or "song" might have a higher cross-cultural validity, to be determined.
I would rather go for a very small set of genres and only provide genres which are likely to be actually queried for.
I've been modeling texts like this for now:
https://fl.mt/cldf-ldd/latest/components/texts/
It's based on your lapollaqiang dataset, I just put in a json column for metadata.
so, currently examples.csv
seems to be an unordered set. This ignores the fact that csv has an inherent linearity. Row 4 follows row 3, and this order will be preserved on updated, changes of row content etc.
There are at least three ways to model linear order:
1) implicit
ID text
1 ich kam
2 ich sah
3 ich siegte
2) links within table
ID text next
1 ich kam 2
2 ich sah 3
3 ich siegte NULL
3) external table
examples.csv
ID text
1 ich kam
2 ich sah
3 ich siegte
order.csv
1 sentences.csv 1
2 sentences.csv 2
3 sentences.csv 3
Option 2 would allow for arrangement of texts where the order of presentation is different from the order of storage. Option 3 would allow to model different orders concurrently.
I see no use cases for either of them, so for me, there is no need to go for anything more complex than Option 1
We have been using a column "Line_Number" in all those cases where we modeled Chinese data in CLDF, which are not glossed in the strict sense. Another possibility, which we are exploring now for texts, is to model them on a per-word basis, adding info on units (phrase, sentence) with order in extra columns.
Here's an example on how I handled texts in Chinese rhymes (which is probably not Eurocentric ;-)
I am afraid I cannot see the example :(
Line_Number
should probably be fine. Values must be unique, and should be continuous integers.
I've been modeling texts like this for now:
Maybe TextCorpus
could rather be a CLDF module, i.e. a collection of CLDF components with module-specific semantics. A ParameterTable
in a StructureDataset
is typically the list of features while it is the list of concepts in a Wordlist
. So in a TextCorpus
, the ExampleTable
would be the list of sentences and ContributionTable
could be the list of texts.
To elaborate on the "TextCorpus as module" idea: This model is almost followed for the texts of the Tsez Annotated Corpus. Almost, because what I actually did was extracting a morpheme dictionary from the glossed sentences and then turn the result into a CLDF Dictionary. While this "works", I think it is not very transparent and with a CLDF Module TextCorpus
, the texts would actually belong into a TextCorpus
- and we may use a FormTable
in a TextCorpus
to provide a reverse lookup like the one used for Tsez: https://tsezacp.clld.org/units
And one more thought on "TextCorpus as module": I think modules add a useful abstraction layer that helps with re-use (e.g. the Wordlist / StructureDataset distinction at least adjusts expectations, but also helps with automated re-use: plotting dots on maps for wordlists is typically not a good idea, etc.).
There's a small price to pay for this: There are data collections which have more than one type of data. E.g. the Hindu Kush Areal Typology data contains typological data as well as wordlists. Thus, this data is partitioned into two CLDF datasets. While this makes CLDF creation (via cldfbench
) a bit more complicated, it's pretty transparent for data consumers. There's two CLDF metadata files in https://github.com/cldf-datasets/liljegrenhindukush/tree/master/cldf but they can even share the LanguageTable. Thus, a StructureDataset
could easily include the sentences of a TextCorpus
and re-purpose them as examples for structural features.
My Yawarana corpus dataset has an exampleparts.csv
table that links wordforms.csv
to examples.csv
(with a positional index). Because morphs.csv
are linked in the same way to wordforms.csv
(and morphs.csv
are linked via Morpheme_ID
to morphemes.csv
), a reverse lookup can easily be done from morpheme to example, and because examples.csv
has a Text_ID
(and a Sentence_Number
) column, the example is also linked to the text.
You can see an example of said lookup here: https://yawarana-sketch.herokuapp.com/morphs/akere-with
I see you are using a Generic dataset for the corpus. So I guess you would be ok with a more meaningful module for this, too?
Florian Matter @.***> schrieb am Di., 9. Jan. 2024, 17:11:
My Yawarana corpus dataset https://github.com/caribank/yawarana-corpus-cldf has an exampleparts.csv table that links wordforms.csv to examples.csv (with a positional index). Because morphs.csv are linked in the same way to wordforms.csv (and morphemes.csv to morphs.csv), a reverse lookup can easily be done from morpheme to example, and because examples.csv has a Text_ID (and a Sentence_Number) column, the example is also linked to the text.
You can see an example of said lookup here: https://yawarana-sketch.herokuapp.com/morphs/akere-with
— Reply to this email directly, view it on GitHub https://github.com/cldf/cldf/issues/151#issuecomment-1883344121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKCCOHG2BKAAW7P6773YNVTZ3AVCNFSM6AAAAABAWBCZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBTGM2DIMJSGE . You are receiving this because you were assigned.Message ID: @.***>
Absolutely -- I only used Generic because there's nothing else (and I didn't want to go through the hassle of trying to implement my own module...). As long as I can have the functionalities I need, I'm good.
https://fl.mt/cldf-ldd/latest/components/ gives a good overview of what I currently have.
Multi-level ordering of lines, i.e. rows in ExampleTable
could be achieved with a list-valued ordinal
column.
Texts, i.e. multiline IGTs can only be modeled implicitly now, because there's no way explicitly group and order the lines.
A lightweight solution could be adding a list-valued
exampleReference
property toContributionTable
. But maybe we'd want to add text-specific metadata, too, e.g. genre.