Strict-Lax row assignment - Treatment of synonyms

nataliacp commented 8 years ago

The automatic alignment algorithm cannot tolerate tons of synonyms for one concept (this is a relative not an absolute issue, so this doesn't mean we have to have one entry per concept). However in some cases (e.g. tukano for bird) there are tons of entries in the strict row, when it is obvious that most of them are species specific names. I think the easiest way to deal with this is to sort the importation template files by the FUN field, look at all the items with identical unified translations (TUE field) and sort them accordingly.

LinguList commented 8 years ago

As a rule of thumb, I would ask all collaborators working on language data to restrict the use of synonyms in strict rows to 3.

thiagochacon commented 8 years ago

If that works, great! Or I can simply point out to you the form with the generic meaning for bird.

Enviado do meu smartphone Samsung Galaxy.

-------- Mensagem original --------

De: Natalia Chousou-Polydouri notifications@github.com

Data: 06/02/2016 05:15 (GMT-05:00)

Para: digling/tukano-project tukano-project@noreply.github.com

Cc: thiagochacon thiago_chacon@hotmail.com

Assunto: [tukano-project] Strict-Lax row assignment - Treatment of synonyms (#3)

The automatic alignment algorithm cannot tolerate tons of synonyms for one concept (this is a relative not an absolute issue, so this doesn't mean we have to have one entry per concept). However in some cases (e.g. tukano for bird) there are tons of entries in the strict row, when it is obvious that most of them are species specific names. I think the easiest way to deal with this is to sort the importation template files by the FUN field, look at all the items with identical unified translations (TUE field) and sort them accordingly.

Reply to this email directly or view it on GitHub: https://github.com/digling/tukano-project/issues/3

levmichael commented 8 years ago

Why would there be lots of specific species names in the BIRD row in the first place? Should they all be in the LAX row?

nataliacp commented 8 years ago

I think they should be in the lax row, but we found at least one case where they are all in strict. Apart from such cases though, there are many meanings with more than 3 items in the strict row, many of which seem variants of each other. During the doublechecking of the importantion files it would be good to see if some of the entries could be consolidated (if truly the difference is only a variation in pronunciation).

amaliaskilton commented 8 years ago

I agree with this for the generic vs. specific biological terms and phonological variants issues, but I think a restriction to 3 synonyms would potentially cause either loss of data or stipulative judgments about semantic identity for some of the verb root rows. For example, in the Colombian Siona column of the BREAK VI row, I entered 4 items that mean something like "break" but have fairly generic glosses in the data source. I happen to know that all 4 items have cognates in the other WT languages, and I know what those cognates mean in Mai, but the data source gloss did not make me confident that the meanings were the same in C. Siona; thus they went in the most generic row just so they would be represented. What do you propose to do about cases like this?

nataliacp commented 8 years ago

actually, I think that the big problem is multiple items that are potentially cognate to each other (i.e. they look very similar) so they align to themselves. @Mattis, can you confirm this? so, my understanding is that the 3 per meaning is a rule of thumb, it is not absolute. If you have to include more because there is no way to find the most basic, most common item, then we will have to work with that.

On Wed, Feb 10, 2016 at 5:43 PM, amaliaskilton notifications@github.com wrote:

I agree with this for the generic vs. specific biological terms and phonological variants issues, but I think a restriction to 3 synonyms would potentially cause either loss of data or stipulative judgments about semantic identity for some of the verb root rows. For example, in the Colombian Siona column of the BREAK VI row, I entered 4 items that mean something like "break" but have fairly generic glosses in the data source. I happen to know that all 4 items have cognates in the other WT languages, and I know what those cognates mean in Mai, but the data source gloss did not make me confident that the meanings were the same in C. Siona; thus they went in the most generic row just so they would be represented. What do you propose to do about cases like this?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-182471893 .

LinguList commented 8 years ago

If you insist on keeping more than two words per base meaning, I'll simply toss a coin and only consider two words for the first automatic cognate detection. There were just so many minimal variants in one of the files that it was impossible to find any signal in the data. Given that there is the possibility to define the "lax" rows, I'd prefer to make use of that features and break it down to two words per meaning, since this will be optimal for the initial calculation of cognate sets within the same meaning. Later on, these words won't be lost, if not annotated, and can be assigned to the cross-semantic cognate sets.

If concepts turn out to be fuzzy, like, e.g., "break", I'd advise to try and specify them ("break in two"), since it is dangerous to be too generous about fuzzy semantic matchings. Note for the cognate assignment task, but for later use. Especially when we want ot make use of semantic data, like clics, it is important to have concepts defined in a clean way, so that we can link them to the concepticon and from there to clics, where you'll find all good initial ideas regarding concepts that may contain cross-semantic concepts.

So, be strict, reduce, but when reducing, keep the things clear, so that they are not lost to us.

levmichael commented 8 years ago

Conceptual fuzziness on our end is not in general the problem, as you'll see, for example, if you look at the set of BREAK-related meanings in our list (we have 9 BREAK-related concepts). Some of the sources we are working with are vague about certain meanings, however, and this is probably one of the major sources of putative synonymy. That's the situation that Amalia is describing. I'm not sure that there's anything principled that can be done about that kind of case.

But Mattis, you seem to be describing a different problem with the "numerous minimal variants" case. Could you say more about that one?

LinguList commented 8 years ago

We need this strictness for the workflow. This workflow differs a bit from classical cognate-coding: we want to align the data and we want to use automatic cognate detection to preparse the data, later to be corrected by the experts. Alignmetns help us to make very, very strict assessments regarding sound correspondences and their regularity. This is helpful, since experts are all humans, so they cannot count all instances of pairings in their head, and tools like reflex and edictor allow exactly to look at the counts.

For alignments, and for automatic cognate judgments, I need a clean dataset with a limited set of comparanda in the beginning, since my experience shows that otherwise my algorithm has large problems to identify patterns.

Once this first analysis is done, we can go wild, but for this first analysis, it is important to have a very clean dataset, and I suppose that it will also be better for the initial correction phase of cognate assignments, since one reduces confusing information.

Afterwards, that's why I said, it is not lost, one can include the remaining synonyms step by step, in case they are useful.

levmichael commented 8 years ago

Sure, I understand. I guess my question wasn't clear -- I was asking about what the case was of the "minimal variants" that you alluded to above. What meanings, what languages? (Really, I'm trying to understand what you mean by 'minimal variant'.)

LinguList commented 8 years ago

Oh, blame my mothertongue and the fact that it's late in teh evening and I was on other things with my thoughts:

there were variatns of obviously the same word which minimally varied from each other.

I know this problem from Chinese dialects, where datasets show some 5 words for "sun", most of them related, sometimes slight morphological differences, etc. (eg., "tai-jaŋ", "tai-jaŋ-fu", "tai-jaŋ-ba-ba", etc.) In these cases of obvious cognacy of variants of the same word in the same language, algorithms get hopefully lost, since they have strong similarity in their language, but all of them have further shades of similarities to words in other languages. In the end, it gets a messy blur, at least for my poor algorithm, which then looks as if it is a complete idiot...

levmichael commented 8 years ago

OK, I see what you mean. Well, for what it is worth, I think that the vast majority of cases where we have multiple items in a cell are not of the Chinese-sun type case you sketch out above. In the Tukanoan case the roots are generally quite different shape, and its lack of semantic detail in the sources that is largely responsible for the multiplicity of items. We'll have a look, though, and make sure that there isn't anything that we can clean up at this stage (e.g. specific species names in the strict row for concept that is actually more generic). We'll have to see, though.

nataliacp commented 8 years ago

Keep in mind that these first observations were done with 3 languages Maihiki, Tukano and Karapana. In those and especially in Tukano there were tons of these close variants in the same cell. Maybe they represent different sources? I can't remember if that was the case. But even then it shouldn't be so since we would expect the quasi-phonemic forms to be the same even from different sources. I think it is easier if you or Thiago have a look at Tukano and see what it looks like within each cell.

On Thu, Feb 11, 2016 at 3:56 PM, levmichael notifications@github.com wrote:

OK, I see what you mean. Well, for what it is worth, I think that the vast majority of cases where we have multiple items in a cell are not of the Chinese-sun type case you sketch out above. In the Tukanoan case the roots are generally quite different shape, and its lack of semantic detail in the sources that is largely responsible for the multiplicity of items. We'll have a look, though, and make sure that there isn't anything that we can clean up at this stage (e.g. specific species names in the strict row for concept that is actually more generic). We'll have to see, though.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-182896939 .

levmichael commented 8 years ago

Ah, OK, I'll have a look at Tukano, and see what the issues are, just so that I'm better informed. (I imagine that Thiago will want to as well.) Could I ask what Tukano list I should look at? The one on the Tukanoan 740 comparative list Google spreadsheet? (We're making progress on data centralization, but we're not there yet...)

nataliacp commented 8 years ago

yes, that's the only Tukano list we have (it's one of the languages that we have only partial sources for)

On Thu, Feb 11, 2016 at 7:38 PM, levmichael notifications@github.com wrote:

Ah, OK, I'll have a look at Tukano, and see what the issues are, just so that I'm better informed. (I imagine that Thiago will want to as well.) Could I ask what Tukano list I should look at? The one on the Tukanoan 740 comparative list Google spreadsheet? (We're making progress on data centralization, but we're not there yet...)

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183001806 .

gomezimb commented 8 years ago

Sorry I arrive late to the discussion. I agree with Lev that we must represent morae /ee/ etc. Barred u have to become barred i. I can have a look at TUK & KAR data. KAR has no tones but I can help representing TUK Utones. Elsa

Le 11 févr. 2016 à 20:07, Natalia Chousou-Polydouri notifications@github.com a écrit :

yes, that's the only Tukano list we have (it's one of the languages that we have only partial sources for)

On Thu, Feb 11, 2016 at 7:38 PM, levmichael notifications@github.com wrote:

Ah, OK, I'll have a look at Tukano, and see what the issues are, just so that I'm better informed. (I imagine that Thiago will want to as well.) Could I ask what Tukano list I should look at? The one on the Tukanoan 740 comparative list Google spreadsheet? (We're making progress on data centralization, but we're not there yet...)

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183001806 .

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183013374.

levmichael commented 8 years ago

Yes, I was wondering that about the barred-u myself. So, adjustments like these will be handled through the symbol equivalence tables that Mattis has asked for. We've begun on those, and will share them with the Tukanoanists for collaborative work once we've made a little bit more progress.

nataliacp commented 8 years ago

Just a reminder to please keep discussions distinct. This issue is about synonyms, there is a separate one for the representation of vowel sequences. This just makes it easier to keep all relevant information in one place and once we have reached a decision to close the issue, without having forgotten stuff elsewhere.

On Fri, Feb 12, 2016 at 4:57 PM, levmichael notifications@github.com wrote:

Yes, I was wondering that about the barred-u myself. So, adjustments like these will be handled through the symbol equivalence tables that Mattis has asked for. We've begun on those, and will share them with the Tukanoanists for collaborative work once we've made a little bit more progress.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183385244 .

levmichael commented 8 years ago

Ah, very good!

levmichael commented 8 years ago

OK, I've just spent a while looking at the Tukano list, and I agree that there is a lot that should be moved from strict rows to lax rows. Most of these are not phonological variants or cases in which different morphology attaches to a single relevant root (although there are cases like this), but cases in which forms with meanings close-ish to the target concept are included in the cell along with a form that an exact hit on the target concept. These latter ones should, of course, be moved to the lax rows.

My question is how we should proceed. First, there is the question of who should do this. Thiago, what do you think? I'm concerned that your Australian trip will make this difficult for you to attend to this right now. I'm happy to do everything I can and discuss the difficult cases with you, if that would work better. Let us know what you think.

Second, there is the question of where this should be done. I infer that it should be on the 740 list itself -- is that correct Natalia?

nataliacp commented 8 years ago

It could be done in either place, the 740 spreadsheet or the importation template. In any case all the unified translations for the lax items will have to be changed in the importation template, so maybe it is easier to do it there. As long as Tukano is otherwise ready, I think it may be better to deal with other languages and return to this after Seb has parsed it.

On Fri, Feb 12, 2016 at 9:14 PM, levmichael notifications@github.com wrote:

OK, I've just spent a while looking at the Tukano list, and I agree that there is a lot that should be moved from strict rows to lax rows. Most of these are not phonological variants or cases in which different morphology attaches to a single relevant root (although there are cases like this), but cases in which forms with meanings close-ish to the target concept are included in the cell along with a form that an exact hit on the target concept. These latter ones should, of course, be moved to the lax rows.

My question is how we should proceed. First, there is the question of who should do this. Thiago, what do you think? I'm concerned that your Australian trip will make this difficult for you to attend to this right now. I'm happy to do everything I can and discuss the difficult cases with you, if that would work better. Let us know what you think.

Second, there is the question of where this should be done. I infer that it should be on the 740 list itself -- is that correct Natalia?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183472662 .

levmichael commented 8 years ago

OK, well, let's see what Thiago says, but I think it might be easiest to work on in the 740 spreadsheet, since it's easy for us both to access that. I agree with what you suggest, though, of working on other languages, and then coming back to Tukano

thiagochacon commented 8 years ago

Lev, I arrived yesterday night and will have more time to wok now. Since yoy inspected the data already you might wabt do it already. After working on karapana and kubeo I could join you and we split the work. How about hat? I also agree we should do the changes in the importarion template.

Enviado do meu smartphone Samsung Galaxy.

-------- Mensagem original --------

De: Natalia Chousou-Polydouri notifications@github.com

Data: 12/02/2016 16:14 (GMT-05:00)

Para: digling/tukano-project tukano-project@noreply.github.com

Cc: thiagochacon thiago_chacon@hotmail.com

Assunto: Re: [tukano-project] Strict-Lax row assignment - Treatment of synonyms (#3)

It could be done in either place, the 740 spreadsheet or the importation template. In any case all the unified translations for the lax items will have to be changed in the importation template, so maybe it is easier to do it there. As long as Tukano is otherwise ready, I think it may be better to deal with other languages and return to this after Seb has parsed it.

On Fri, Feb 12, 2016 at 9:14 PM, levmichael notifications@github.com wrote:

OK, I've just spent a while looking at the Tukano list, and I agree that there is a lot that should be moved from strict rows to lax rows. Most of these are not phonological variants or cases in which different morphology attaches to a single relevant root (although there are cases like this), but cases in which forms with meanings close-ish to the target concept are included in the cell along with a form that an exact hit on the target concept. These latter ones should, of course, be moved to the lax rows.

My question is how we should proceed. First, there is the question of who should do this. Thiago, what do you think? I'm concerned that your Australian trip will make this difficult for you to attend to this right now. I'm happy to do everything I can and discuss the difficult cases with you, if that would work better. Let us know what you think.

Second, there is the question of where this should be done. I infer that it should be on the 740 list itself -- is that correct Natalia?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/3#issuecomment-183472662 .

Reply to this email directly or view it on GitHub: https://github.com/digling/tukano-project/issues/3#issuecomment-183490041

levmichael commented 8 years ago

OK, I'll get started on the Tukano strict-lax sorting, and once you're done with Karapana and Kubeo, we can see where things stand.

nataliacp commented 8 years ago

Just a clarification on LAX rows. We need those rows to be noted in a uniform manner so we can treat them automatically. Seb already corrected some of them, where -LAX was not at the very end of the gloss, or it was on the wrong column. Please, if you add new lax rows be careful to add the -LAX tag

in the EN_GLOSS column only and
at the very end of the gloss

digling / tukano-project

Strict-Lax row assignment - Treatment of synonyms #3