cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

X-Sampa to CLTS #3

Closed FredericBlum closed 1 year ago

FredericBlum commented 2 years ago

For the transcription, all phones are currently in X-Sampa and need to be transfered to CLTS.

FredericBlum commented 2 years ago

@Lingulist You mentioned something about extracting concordances for this conversion, but I am not sure what you are referring to. Could you elaborate briefly?

LinguList commented 2 years ago

I recommend reading our paper, List, Sims, Forkel 2020 on IGT for this purpose, where we mention this (Robert has developed tge package further by now).

xrotwang commented 2 years ago

pyigt can be used to extract word/morpheme concordances, not phoneme concordances. So I don't think it's relevant for X-Sampa to CLTS conversion.

LinguList commented 2 years ago

It depends on the corpus structure, I thought, we first get a concordance of words and then convert those to clts/bipa, with the typical orthoprofile procedure from pylexibank. Here, you woukd use a concordance to get those lexemes, right?

LinguList commented 2 years ago

But if that is not the case, one needs to use segments directly, which changes the procedure of applying the profile.

xrotwang commented 2 years ago

Ah, ok. Yes, one could do that - although I wouldn't want to bring in all the pylexibank machinery in this repos. So maybe we should

Then, copy the profiles back here and add the CLTS conversion to the makecldf command.

LinguList commented 2 years ago

Yes, sounds like a plan. We have a rather complete sampa profile. Need to look that up when I find time. It may be in the orthograpy repo...

LinguList commented 2 years ago

It is https://github.com/orthograpy/orthograpy, if I am not mistaken.

xrotwang commented 1 year ago

See https://github.com/cldf-datasets/doreco/blob/main/etc/orthography.tsv