bhallen / icelandic-transcriber

Simple tool for converting Icelandic orthography into broad phonetic transcriptions
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Rule ordering decision #2

Open bhallen opened 10 years ago

bhallen commented 10 years ago

So far we've been assuming that we'll worry about the ordering of different rules later, but I think we should do a little planning ahead.

How about this: our first set of rules converts all digraphs and diphthongs into transcriptions, and then the second set converts all other individual segments into transcriptions, and only then do we have the third set which does various "phonological" transformations (voicing, epenthesis, etc.). We'll of course need some ordering within the third set, and hopefully no necessary orderings within the first or second.

This means, then, that things like our expansion of C/V should be into transcriptions, not orthography, since (I hope!) none of the basic conversion of digraphs/diphthongs/single-segments will need those particular contexts...

bhallen commented 10 years ago

Let me clarify that my proposal may require doing some things that would otherwise be superfluous, for example: initially all [g] in the orthography will get converted to IPA [ɡ], and only afterwards will intervocalic [ɡ] be converted into [ɣ]. So, essentially, I guess we'd be constructing "URs" before carrying out "phonology". This seems to me like the most straightforward way to get rid of the issue of whether rule environments should contain orthography or transcriptions.

a-martin commented 10 years ago

In theory, this is the cleanest approach, and definitely how I would like to do it. It hinges however on the stability of grapheme to phoneme correspondence, which I'm willing to bet won't get us all the way. There may be rules that depend on the underlying nature of a sound that may need access to the orthography (which is all we have for underlying representations in this program) that would have to be part of the first set of rules you describe.

If any graphemes become unrecoverable after the first set of rules (because two graphemes change to the same thing) but are implicated in different rules, then the rules will need to come before the change. I don't think I'm being clear, but I agree with your method to start with and as we run into problems we can change the order. We should start with a basic conversion of as many sounds as possible.

I personally think that redundant rules are not a big deal at all, and at worst add a line of code and at best increase transparency so the g to g rule is not a problem IMO.

a-martin commented 10 years ago

In fact this might be more complicated than it seems. Take this example:

We want to resolve the digraph ‹au› as /øy/. We cannot do this until we have already converted the grapheme ‹y› as /i/ because the symbol y is ambiguous between the grapheme and the IPA. One thing we could maybe do to cheat is force the input to uppercase and then say that uppercase is orthography and lowercase is our output. Otherwise we'll have to accept mixed up rules such as:

  1. y --> i / _
  2. au --> øy / _

A similar problem comes up with ‹æ› to /ai/. What do you think? Would the simplest solution not be to play with the order (painstaking I know, but it seems to be the most obvious way to solve this issue)?

Another example is the special diphthong /ʏi/ which is formed as follows (consider the word "hugi"):

u --> ʏi / _j

This of course means that we must apply this rule after the transformation of the orthographic ‹g› to /j/ (from the following i) and before we apply the ‹u› transformation rule. So the rules should be ordered as follows:

  1. g --> j / _i
  2. u --> ʏi / _j
  3. u --> ʏ / _{C, #}

Note of course that rule 3 will not apply to the "hugi" example.

bhallen commented 10 years ago

Drat, I really would like to avoid having to worry about non-phonological ordering issues, but you've made a good case.

What do you think about this alternative? We forego the simplicity of simply imposing a bunch of RE replaces on the string, and instead gradually build a transcription "tier" from the orthography while still maintaining access to the orthography. I've done something similar to this as a way to apply morphological rules in another project of mine ( https://github.com/kuzum99/sublexical ).

For the example, from a word like , we would first initialize a length-four transcription, [None, None, None, None]. Then we would apply all digraph transcription rules, which would get us [None, ø, y, None]. We'd then apply other ("monograph"?) transcription rules, to end up at the final [tøyt].

Your "hugi" example really does seem to show that we need some rule ordering, though... good find! But I wonder if we might not be able to get the rules to "order themselves" for us. Here's what I'm imagining: we have no digraphs, so we do ortho->transcription conversion to get [hʏgi]. Then we run all the phonological rules on this intermediate transcription. Only your rule 1 activates, giving us [hʏji]. Then we run all the phonological rules on this, and it gives us the desired [hʏiji]. Is it true that this would work? The fact that you wrote u -> ʏ / _{C, #} rather than just u -> ʏ worries me somewhat...

I guess one really clever way we could do this would be as a sort of a machine learning exercise. We start by doing plain "monograph" to segment conversions, and then compare this against some tagged examples of good (whole word) orthography to transcription conversions. The algorithm would learn that some 2-segment and 3-segment sequences need to be converted in a special way, such that (hopefully) "hugi" would be converted using just 2 rules: h -> h and ugi -> ʏiji. But I guess that approach is beyond the scope of what we're trying to do here, and anyway there's no evidence that it would generalize to unseen sequences as well as the phonological rule-based approach.

a-martin commented 10 years ago

Re: machine learning. I had a similar thought last week but thought it was far beyond the scope of this project (and also unnecessary for a language like Icelandic with a relatively shallow orthography). I think this would be an interesting solution for languages like English (and certainly French) but I think we can get our result in a simpler way for the present project. Funny thought that you came up with that solution too!

An issue I see with your tiered idea is when digraphs correspond to single transcriptions (like ‹hr› becoming /r̥/. If we initialize the transcription tier based on the length of the orthography, the vector won't necessarily be the correct length. A word like ‹hraun› would be initialized [ None, None, None, None, None ] but would need to end [ r̥ , ø , y , n ]. I guess this could be fixed by deleting an element in the vector, but I've never worked this way. Typically if I'm initializing a vector of a certain length, it's a memory issue and changing its length later on defeats the purpose of initializing it. Would that be a viable solution for us nonetheless?

If that's not an issue, then I do think the transcription tier is an interesting idea. I don't know how much time it will actually allow us to save in the end (I'm still not totally sure we'll be able to get out of fiddling with the rules), but I'm definitely willing to give it a try!

Your solution of running the rules multiple times should indeed work (at least in this case). As far as the ʏ rule is concerned, I wrote it that way because that's how it is in the Google Doc. It seems that ‹u› always transcribes to /ʏ/ unless it's in a digraph situation and that /u/ is derived from ‹ú› so I'm not sure why there's a restriction on the righthand environment. Your solution seems like it might thus work.

If we take this tiered approach, I imagine we'll have some rules that apply from the orthography tier to the transcription tier and others that will apply directly to the transcription tier. We should probably think about trying to set those apart, right? How would this change the mapping file?