treatment of wordlist - Githubissues

nataliacp commented 8 years ago

Seb and I discussed recently how we are going to treat the 740 wordlist in reflex. Coupled with this are issues of "fake polysemy" that came up during a recent conversation I had with Amalia and Lev. (by fake polysemy, I mean that while a word is not polysemous from an emic perspective, it seems polysemous because it is present in two places in the wordlist).

So, the solution is the following: You can now give the unified translation you want based on an emic perspective, i.e. it is not necessary to give identical unified translations to all words linked to the same wordlist item. The link to the wordlist item will be done through numbers in the id_word field, a new field added to the reflex importation template. One entry can be attached to multiple wordlist items, by adding the corresponding numbers to the id_word field.

If you look at the 740 file, you will see that we added a column at the very left with an id_word number for each row. Use this number to attach words to the wordlist items (of course this applies only to strict rows, please ignore numbers corresponding to lax rows).

As for the alignment, the wordlist items are going to be used as the translation field there, so Mattis's algorithm can match them up based on the translation.

@amaliaskilton, you can now continue the corrections on Maihiki by copying the id_word number when the attachment to the wordlist is ambiguous.

amaliaskilton commented 8 years ago

Thanks Natalia, I will implement this for the fake-polysemous words in the Mai source.

On Thu, Feb 25, 2016 at 1:17 PM, Natalia Chousou-Polydouri < notifications@github.com> wrote:

Seb and I discussed recently how we are going to treat the 740 wordlist in reflex. Coupled with this are issues of "fake polysemy" that came up during a recent conversation I had with Amalia and Lev. (by fake polysemy, I mean that while a word is not polysemous from an emic perspective, it seems polysemous because it is present in two places in the wordlist).

So, the solution is the following: You can now give the unified translation you want based on an emic perspective, i.e. it is not necessary to give identical unified translations to all words linked to the same wordlist item. The link to the wordlist item will be done through numbers in the id_word field, a new field added to the reflex importation template. One entry can be attached to multiple wordlist items, by adding the corresponding numbers to the id_word field.

If you look at the 740 file, you will see that we added a column at the very left with an id_word number for each row. Use this number to attach words to the wordlist items (of course this applies only to strict rows, please ignore numbers corresponding to lax rows).

As for the alignment, the wordlist items are going to be used as the translation field there, so Mattis's algorithm can match them up based on the translation.

@amaliaskilton https://github.com/amaliaskilton, you can now continue the corrections on Maihiki by copying the id_word number when the attachment to the wordlist is ambiguous.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/27.

thiagochacon commented 8 years ago

I was giving the polsymeous words the treatment you suggested first. I did so for Karapana and for Kubeo. Do I have to change things?

As for the kubeo, I am waiting to send the list once we have an agreement on morpheme boundaries and tones. Currently this is the situation regarding each issue:

Tones are represented underlyingly in PHM (with indication of floating tones), phonetically and with floating tones in FUN (one third of the forms only so gar). I think it would be a lot of work to add phonetic representations of tones to all words, so if we all agree on a separate field for tone alignments I am planning on only adding phonetic tones in there.
I am using the morpheme boundary notations we decided in the beginning of the project in boht PHM and FUN field. This is easy to add or remove from any field based on what we decide.

levmichael commented 8 years ago

Thanks very much to you and Seb for getting this figured out, @nataliacp.

nataliacp commented 8 years ago

@thiagochacon, this arrangement is for the cases of fake polysemy, not real polysemy. With Amalia we discussed that there are cases in Maihiki where words are not polysemous in an emic analysis but still there are two rows in the 740 list where the word would go. The solution above is for such cases, to avoid imposing the polysemy at an emic analysis. So, if there are cases in Karapana and Kubeo where you need to change things, feel free to do so.

thiagochacon commented 8 years ago

When I see Kubeo and Karapan wordlists, most cases of polysemy seems like cases you called "fake polysemy". Hence, they are not really special in the cases where I have worked, but actually the default case. I think you mean things like baba "hand, arm"?

nataliacp commented 8 years ago

from what I understand, distinguishing real polysemy from "fake" is not always very easy. There are semantic tests for that. I think this is ultimately up to you (or to the author of the source if it is not your own fieldnotes). If you think that most cases are of the fake kind, then you can collapse those "polysemous" entries in one row, give them a hand or arm (or maybe upper limb) unified translation and then put the two numbers of the wordlist rows (for hand and arm) in the id_word field (you can create a new column for that). Does this make sense?

gomezimb commented 8 years ago

for Karapana, the absence of tones can make think of polysemy for tone minimal pairs: N°99 $rui$ (sentarse), N°384 $rui$ (descer) are $rùí$ and $rúì$ in TAT; KAR & TAT are very close.

LinguList commented 8 years ago

Just quickly, regarding what @nataliacp said:

from what I understand, distinguishing real polysemy from "fake" is not always very easy.

I wouldn't trust any of these things, since it's a floating frontier here, and you can never really tell. What you can tell, however, is whether a word is identical in sound with another one, and this is what François calls "colexification", and checking this cross-linguistically can help to identify good polysemy candidates, as in our database of cross-linguistic colexifications. And once the data is assembled, I suggest to have just the software run over it and look for colexifications also to identify first hints for semantic shift (but that's future music).

digling / tukano-project

treatment of wordlist #27