Morpheme boundaries - Githubissues

thiagochacon commented 8 years ago

This is to summarize the morpheme boundary symbols we are using in the data and how we should use them in terms of field in the database. Currently I guess most of the language that have morpheme boundaries have it in PHM phonological representation and most languages do not have it in FUN. Is this correct? I wonder if having in both fields or just one is an issue for database consistency and if we should make a decision about that.

The symbols I think we are using are:

affix = clitic
compound (some languages currently have "blank space" for compounds, other have both roots together)

thiagochacon commented 8 years ago

anything decided on this issue after I left?

nataliacp commented 8 years ago

no, nothing was decided on this issue. From what I understand, Mattis's algorithm doesn't tolerate (at least as is) any dashes and things like that. @LinguList can you confirm this? We could have dashes and other symbols in various fields in Reflex, and we could remove them automatically from the FUN field before it is treated by the automated alignment algorithm. (here, I would like to clarify, that although this kind of treatment is possible, it puts a burden on Seb to make custom scripts to export while removing symbols and then add the symbols back to import. So, I think we should minimize this kind of stuff to the minimum necessary for the project). I think that there are two main questions at the heart of this:

What do we want the boundaries noted for? (is it just information we want to have somewhere, or we think that they have to be represented in every representation? Are they central for the comparative aspect of the project or not?)
How many different kinds of boundaries do we want to differentiate? (from the list above, it is not clear to me if affixes and clitics have been noted the same way or not). then, we can talk about the practical issues: i.e. what are the symbols for each kind of boundary, which fields they should be present in, and if we are going to include them in FUN or not.

thiagochacon commented 8 years ago

Hi natalia I think my last message got my symbols converted by github I meant Hyphen for affixesEqual sign for cliticsAnd plus sign for compounds I think we should keep the FUN forms with the minimal information necessary. So agree on not having boundary symbols The same would go with tones for me

Enviado do meu smartphone Samsung Galaxy.

-------- Mensagem original --------

De: Natalia Chousou-Polydouri notifications@github.com

Data: 25/02/2016 12:20 (GMT-05:00)

Para: digling/tukano-project tukano-project@noreply.github.com

Cc: thiagochacon thiago_chacon@hotmail.com

Assunto: Re: [tukano-project] morpheme boundaries (#23)

no, nothing was decided on this issue. From what I understand, Mattis's algorithm doesn't tolerate (at least as is) any dashes and things like that. @LinguList can you confirm this? We could have dashes and other symbols in various fields in Reflex, and we could remove them automatically from the FUN field before it is treated by the automated alignment algorithm. (here, I would like to clarify, that although this kind of treatment is possible, it puts a burden on Seb to make custom scripts to export while removing symbols and then add the symbols back to import. So, I think we should minimize this kind of stuff to the minimum necessary for the project). I think that there are two main questions at the heart of this:

What do we want the boundaries noted for? (is it just information we want to have somewhere, or we think that they have to be represented in every representation? Are they central for the comparative aspect of the project or not?)
How many different kinds of boundaries do we want to differentiate? (from the list above, it is not clear to me if affixes and clitics have been noted the same way or not). then, we can talk about the practical issues: i.e. what are the symbols for each kind of boundary, which fields they should be present in, and if we are going to include them in FUN or not.

Reply to this email directly or view it on GitHub: https://github.com/digling/tukano-project/issues/23#issuecomment-188887474

LinguList commented 8 years ago

One more thing on this: one of the major improvements of my code is that it explicitely handles morpheme boundaries (even partial cognate detection is possible right now). But it is important to keep in mind that missing morpheme boundaries will prevent the algorithm from matching elements with boundaries with elements without boundaries.

So if you code "h a n d s u" one time, and "h a n d + s u" another time, the algorithm will align 'h a n d s u' only with either 'h a n d' or 's u', which is consequent and important for the whole philosophy behind it. So you'll have to correct the missing morpheme boundaries in the alignments manually later.

LinguList commented 8 years ago

Update here, @thiagochacon, please submit concrete examples for all symbols you use for distinction of morphemes, like affixes, and the like. I'm actually willing to adapt LingPy's code, since I understand that a simple segmentation of a word into morphemes is not enough, and for future work also on CLLD, it is important that we discuss these things.

If dashes are used consistently, it is furthermore no problem for me to adapt the code and to switch to new symbols for gaps in alignments. But what I need for this is to have clear-cut examples, so that I can draw a consistent "typology" out of it...

LinguList commented 8 years ago

As a quick online specification, where the current practice of LingPy is discussed, those who are interested can have a look here.

amaliaskilton commented 8 years ago

What do we want the boundaries noted for? (is it just information we want to have somewhere, or we think that they have to be represented in every representation? Are they central for the comparative aspect of the project or not?)

Boundaries are needed for three reasons.

(a) Several Tuk languages have phonological processes which are conditioned by morph boundaries. For example, in Western Tukanoan, allophony between creaky d, modal d, and r is in part conditioned by whether the segment is next to a morph boundary. Likewise, in Eastern Tukanoan, morph boundaries sometimes stop nasal harmony from applying when it would apply in the same (tautomorphemic) segmental environment.

(b) Due to (a), there are sound changes in the lgs (at least in Western Tukanoan) that prima facie appear to be conditioned by the presence/absence of morph boundaries.

(c) Tuk languages have rampant compounding and reanalysis of morphologically complex forms as monomorphemic. Representing morpheme boundaries will make it much easier to see these processes at work.

These do not mean that we need to have morph boundaries at every level of representation, just that they need to be visible somehow as we construct cognate sets. My personal inclination would be that the boundary symbols should be present in the phonemic form and not present in FUN (and this is OK in large part because the phonological processes that the boundaries induce/block will be visible in FUN).

How many different kinds of boundaries do we want to differentiate? (from the list above, it is not clear to me if affixes and clitics have been noted the same way or not). then, we can talk about the practical issues: i.e. what are the symbols for each kind of boundary, which fields they should be present in, and if we are going to include them in FUN or not.

I think the boundaries we need to distinguish are: prefix-root, root-suffix, and root-root in compounds. I am not aware of any Tuk language where it is crucial to distinguish affixes and clitics, but if there is one @gomezimb will know what it is.

LinguList commented 8 years ago

Okay, thanks, so have you already done the annotation in the data, regarding the boundaries? I don't remember to have seen them in what I looked at so far.

amaliaskilton commented 8 years ago

The data that I entered for Mai, Tatuyo, and Barasana has the boundaries. The other languages in which I have a hand (Koreguaje, Colombian Siona, Tuyuca) do not. This is in part bc the sources sometimes do not have enough info about morphology to segment the forms they give.

On Thu, Feb 25, 2016 at 2:09 PM, Johann-Mattis List < notifications@github.com> wrote:

Okay, thanks, so have you already done the annotation in the data, regarding the boundaries? I don't remember to have seen them in what I looked at so far.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-189007375 .

LinguList commented 8 years ago

Sorry again for bothering, @amaliaskilton , do you have the boundaries in different forms, or do you use the same symbol regardless of the relation of morphemes (root, suffix, etc.)? And if you use different symbols, which ones do you use?

amaliaskilton commented 8 years ago

In those lgs I used a dash for all of root-prefix, root-suffix, and root-root. In addition, in the comments field for each morph complex word, I wrote 'complex' or 'compound' and then the morphological analysis of the word.

On Thu, Feb 25, 2016 at 11:53 PM, Johann-Mattis List < notifications@github.com> wrote:

Sorry again for bothering, @amaliaskilton https://github.com/amaliaskilton , do you have the boundaries in different forms, or do you use the same symbol regardless of the relation of morphemes (root, suffix, etc.)? And if you use different symbols, which ones do you use?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-189152149 .

thiagochacon commented 8 years ago

I think we must have boundary symbols listed for every language and try to normalize things when importing it.

I have seena distinction between affixes and clitics in the morphophonology of Kubeo and Kotiria. Indeed, very few people explore this boundary issue in the Tukanoan languages, but at least for Kubeo it is the distinction between affix - clitic - compound is quite important for explaining many phonological rules.

Affix nɨ̃-wɨ̃ go-non.3rd.animate 'I/we/you went'
clitic hẽmẽ=bo paca=cl.round.big 'a paca (mammal sp.)
compound ɨ̃rẽ+koro pupunha+liquid 'pupunha juice'

nasality tells the difference between an affix and a clitic: in the first example the verb stem nasalized the suffix; in the second example the noun stem fails to do so to the clitic. The difference between a clitic and a root (as in a compound structure) usually can be seen in some segmental rules (e.g. d > r is obligatory in clitica, but optional in compounds) or suprasegmentals (a root in a compound can keep its original tones under special circumstances, as well a secondary stress).

LinguList commented 8 years ago

Thanks for the examples, @thiagochacon. Compunds are pretty simple, I guess, and a "+" seems to be in order as a symbol. With the other two, I would be concerned if we encounter infixes in addition to affixes (surely not for TK, but since the representation should ideally be generalizable within CLPA, we need to think about this).

Here, we face ambiguities with prefixe vs. suffix vs. infix. Already in a string like

nɨ-wɨ

I can't tell whether nɨ is prefix or wɨ is affix.

For the computer, it would be easiest to use starting and end markers, like:

n ɨ → w ɨ

meaning that n ɨ is the prefix, and

n ɨ ← w ɨ

the opposite. But this is just an example, no proposed solution of how to handle it in the end. I just hope you see the problem of ambiguity here, and it applies to both affixes and clitics. I can't think of alternative solutions that would avoid given the "direction" (read, e.g., the → as "attaches to"). And it would also allow us to handle infixes in other languages:

n ɨ ← b a → w ɨ

I know that the people from the GLD project have their own system of distinguishing affixes and the like, and it involves dashes and the "=" character. I could ask them what they do in these cases.

levmichael commented 8 years ago

@thiagochacon, I like your boundary symbols. I'm very interested to hear what Amalia thinks, though, since any decisions we make will have significant implications for the data she has and will be processing.

levmichael commented 8 years ago

@LinguList, I appreciate the point you're making re: CLPA, but I must admit that what I want is an economical solution for Tukanoan, and I want it soon, so that the work can resume. I'm not trying to be obnoxious, and I don't mean to suggest in any way that CLPA-oriented issues aren't important, but I think that we're in danger of getting very bogged down in issues that have very little to do directly with our (Tukanoan) project goals.

amaliaskilton commented 8 years ago

@thiagochacon that's very interesting data. @gomezimb and I talked about a similar phenomenon in Tatuyo which Elsa attributes to the prosodic size of the elements rather than their position on the prosodic hierarchy.

I approve of these boundary symbols too, with three caveats -

1) We should distinguish different boundary symbols for prefixes and suffixes. This is necessary because there are a handful of languages that have both prefixes and suffixes for the same word class (minimally, Mai and Tatuyo). If we have the same boundary symbol for prefix and suffix, a string XX-YY-ZZ will be ambiguous between a parse prefix-root-suffix and a parse root-suffix-suffix (although providing good info in the comments field about the morphological composition will go a long way toward preventing ambiguity of that type).

2) In the lgs I am familiar with, there are many, many stems which (considering only phonological evidence) could be reasonably analyzed as either root-affix complex stems or root+root compounds. For languages we are personally familiar with, we can use complex phonological evidence, syntactic evidence, etc to pick between the candidate analyses. But for the languages which we don't know a lot about - Koreguaje is one such case for me - it is going to be tough to make informed decisions on these cases.

For cases of unclarity, then, I think we should adopt a global policy about which way to err w/r/t annotating the unclear items as root+root compounds or as root-affix complex. My inclination would be to make the default analysis root+root.

3) Complex forms of all kinds should be annotated with a morphological analysis in the comments field, when data permits. It looks like @thiagochacon has done this for Kubeo already, but I think it bears making official.

@levmichael For the data I have entered, it should be fairly simple to update boundary symbols to these conventions - compounds are already tagged with "compound" in the comments field, and there are not very many forms with prefixes.

On Sun, Feb 28, 2016 at 3:57 PM, levmichael notifications@github.com wrote:

@thiagochacon https://github.com/thiagochacon, I like your boundary symbols. I'm very interested to hear what Amalia thinks, though, since any decisions we make will have significant implications for the data she has and will be processing.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-189972755 .

LinguList commented 8 years ago

@levmichael, I understand your point and that you think more with respect to TK, but the help that my algorithms provide and the work I invest in pretty dumb tasks like parsing through files, are also things to speed up your project. So if I spend my time in discussions, coding, and thinking, I hope you understand that I need some payoff for my own work (and I have my own official research project to pursue at the moment). And what I insist upon is that the things are on-line with other tools and formats, and that they are also generalizable and abstractable to other cases. We don't need yet another idiosyncratic database that comes along with inconsistencies and ad-hoc decisions.

And please don't forget the payoff of tools like CLPA, Glottolog, and Concepticon: If your data conform to their standards, you can profit from the additional information they provide, and there are many of them, don't underestimate this aspect.

nataliacp commented 8 years ago

Hello everybody. Sorry I am getting back to this only now. I think we are running into the danger of over-complexifying our lives with so many conventions. From my TG experience, as Amalia also said, it may be often difficult to tell if it is a root-root or a root-affix complex form. Also, what Amalia said about bimorphemic words being reanalyzed as monomorphemic in some languages makes me very worried about putting all sorts of dashes in the FUN forms (because it will create forms with dashes cognate at their whole length with forms with no dashes which cannot be handled automatically). I think interapplicability of tools means to be simple and adaptable, not a priori trying to make tons of distinctions that may not be relevant or desirable for all applications. If we want to use morpheme boundaries then a dash is enough for all cases in my opinion. The various distinctions in affixes, clitics and roots are good for synchony but much of that breaks down in diachrony. What in one language is a clitic it may be an affix in another. So, my proposal is the following: all kinds of boundaries (if they are important for the language in question and distinguishable for the language in question) can go in say the PHM form or another form field, but not in the FUN. In the FUN, if any boundaries are put they should be only dashes. Then, in Reflex, one can manually change some of them to clitics iwth a = or to some other symbol if necessary.

If we decide to add dashes to the FUN, there is one remaining issue which is that we should put the boundary in the FUN even in the cases where the form has been reanalyzed as monomorphemic. By the way, @LinguList what happens if a morpheme boundary is not recognized say by mistake? Would the words be partially aligned with the rest hanging or not? If yes, then it's no big deal, we can correct the situation manually afterwards by adding a dash and aligning the rest.

LinguList commented 8 years ago

LingPy aligns all things you give it. If you add morpheme boundaries, however, lingpy has the practice of NEVER aligning one morpheme with two morphemes in another language. This will lead to things spreading then, if morpheme boundaries are inconsistently marked. It may be an advantage, since in many cases, morpheme boundaries can only be detected in historical comparison, so they will need to be re-introduced later, when scholars check alignments and see that these cases happen.

I also agree that overcomplicating things is not needed at this stage, especially since knowledge reg. the languages seems to be diverse, which complicates things even more.

BTW, all interested in the specifics of LingPy's alignment, may want to have a look at this short paper, where the issue of morpheme-sensitive alignment is explained:

https://zenodo.org/record/12242

thiagochacon commented 8 years ago

Things are already too complex, but morpheme boundaries at least in the PHM field, would not overcomplicate things. Here is a second proposal Prefix -> Suffix <- Circumfix <-> Proclitic => Enclitic <= Compound +

I also agree with @amaliaskilton that when in doubt about the phonological status of a morpheme (free or bound form), we should go with the least bound one. hence, if we think it could be a root, just go with it. this makes sense in the phonological dependency cline.

thiagochacon commented 8 years ago

As for the CLPA issue raised above, I think we should keep up with our collaborations here, both among Tukanologists (despite how much we disagree), Reflex and CLPA and things alike.

We are already way behind any schedule, so I think nothing can harm us anymore on this. Lets proceede with calm now, have a nice dataset and since we are benefiting from existent computational resources, nothing more justifiable then help improving them too.

LinguList commented 8 years ago

Hey @thiagochacon, I actually think that this is a nice proposal with the ->, <-, => etc., and I think this could be included in CLPA. We should discuss it also over there (once we have time).

levmichael commented 8 years ago

Let's see if we can reach a conclusion here so that wordlist work can resume. @amaliaskilton and @nataliacp, could you weigh in on @thiagochacon's boundary symbol suggestions?

amaliaskilton commented 8 years ago

I am happy with the new boundary symbols proposed by @thiagochacon - these distinctions shd be enough to represent all of the morpheme categories relevant for Tukanoan languages.

That said, @LinguList, I would have great reservations about trying to define a list of morph categories intended to be applicable for all languages or even all S American languages. That is a deeply theoretical task - not a descriptive one - and one where phonologists/morphologists with different theoretical perspectives would probably disagree enormously about what categories to include. For example, in generative phon/morph there is a debate about whether affixes are actually specified in the UR for their position relative to the morphological base; this debate is due to the existence in some lgs of variable affix order and mobile affixes, but it bears on the way we construct URs for complex forms in all lgs. For that reason I would hesitate to try to construct a general vocabulary of morph types based on the vocabulary needed for a lg family specific project like this one.

On Saturday, March 5, 2016, levmichael notifications@github.com wrote:

Let's see if we can reach a conclusion here so that wordlist work can resume. @amaliaskilton https://github.com/amaliaskilton and @nataliacp https://github.com/nataliacp, could you weigh in on @thiagochacon https://github.com/thiagochacon's boundary symbol suggestions?

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-192699170 .

LinguList commented 8 years ago

I agree, @amaliaskilton, that it is difficult to define things cross-linguistically, but this is not a reason not to try it! You see, if we want to go ahead with our science, we need to think of cross-linguistic representations, albeit at a minimal level which may disappoint people working on specific languages. But our experience with IPA and also with concept lists shows that we cannot advance by just keeping to point that things are just language-specific. Phoible has almost 2000 distinct sounds apparently being relevant for the languages in the world, and in the concepticon, we find similar tendencies for concept lists being "relevant" for the languages under investigation, and I'm damn sure that this is not the reality, but just the whims of people who deem their respective languages more special than the rest of the languages in the world. We need to try and standardize things and make sure that we find some way to represent data of one particular language family in a way that typologists and linguists from other language families have a chance of grasping the core problems without digging too deep into the literature. Of course, there will always be language-specific issues, but what @thiagochacon proposed for Tukano in this case seems also to make sense for much of the other data I was dealing with so far. Classical linguistics has a huge problem with not trying to standardize due to language-specific issues, and on the longer run, to not ask certain questions. But what will happen on the longer run, now, with all the digital data out there, is that non-linguists (especially biologists, not you @nataliacp, since you know much more about linguistics than many linguists ) who are not afraid of asking questions and who just want to standardize things will take the lead and produce papers, standards, and things about which the linguistic world is then complaining. It would be better if the classical linguists from the field tried to make an effort on making things comparable cross-linguistically instead of complaining later about the bad quality of superimposed standards and the like. This is really one thing we should all learn from the biologists, namely, that we should keep asking questions and trying to find solutions that apply to different domains, and that we should not ignore certain questions or not make certain attempts, since we think it's impossible from the start. We need more optimism in linguistics, and I think distinguishing prefixes and suffixes with -> and <-, etc. is already a first step, and if it turns out that we cannot explain certain languages truthfully with this, we will work on it.

amaliaskilton commented 8 years ago

The IPA is a highly theoretical device too. But crucially, it is governed by phoneticians and based on findings in phonetics (and therefore it changes with new research findings - Ladefoged famously added dozens of new symbols to the IPA). Linguists sometimes disagree about the utility of IPA symbols, but they accept the alphabet because it represents a consensus in the field of phonetics. This would not be the case for any task specific list of linguistic categories.

On Saturday, March 5, 2016, Johann-Mattis List notifications@github.com wrote:

I agree, @amaliaskilton https://github.com/amaliaskilton, that it is difficult to define things cross-linguistically, but this is not a reason not to try it! You see, if we want to go ahead with our science, we need to think of cross-linguistic representations, albeit at a minimal level which may disappoint people working on specific languages. But our experience with IPA and also with concept lists shows that we cannot advance by just keeping to point that things are just language-specific. Phoible http://phoible.org has almost 2000 distinct sounds apparently being relevant for the languages in the world, and in the concepticon https://github.com/clld/concepticon-data, we find similar tendencies for concept lists being "relevant" for the languages under investigation, and I'm damn sure that this is not the reality, but just the whims of people who deem their respective languages more special than the rest of ! the langu ages in the world. We need to try and standardize things and make sure that we find some way to represent data of one particular language family in a way that typologists and linguists from other language families have a chance of grasping the core problems without digging too deep into the literature. Of course, there will always be language-specific issues, but what @thiagochacon https://github.com/thiagochacon proposed for Tukano in this case seems also to make sense for much of the other data I was dealing with so far. Classical linguistics has a huge problem with not trying to standardize due to language-specific issues, and on the longer run, to not ask certain questions. But what will happen on the longer run, now, with all the digital data out there, is that non-linguists (especially biologists, not you @nataliacp https://github.com/nataliacp, since you know much! more abo ut linguistics than many linguists ) who are not afraid of asking questions and who just want to standardize things will take the lead and produce papers, standards, and things about which the linguistic world is then complaining. It would be better if the classical linguists from the field tried to make an effort on making things comparable cross-linguistically instead of complaining later about the bad quality of superimposed standards and the like. This is really one thing we should all learn from the biologists, namely, that we should keep asking questions and trying to find solutions that apply to different domains, and that we should not ignore certain questions or not make certain attempts, since we think it's impossible from the start. We need more optimism in linguistics, and I think distinguishing prefixes and suffixes with -> and <-, etc. is already a first step, and if it turns out that we cannot explain certain languages truthfully with this, we will work on it.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-192733924 .

LinguList commented 8 years ago

well, in fact it is not based on phoneticians, but in the way it is produced, it is based on people that try to make clear why they need to deviate from the standard, so it is in the end by no means different from linguistic "categories". I don't see any true acceptance of IPA among linguists, given the data I have so far been working with, and I can partially understand the problems, but nevertheless, as the popularity of WALS shows, despite its well-known problems, there is a need on trying to just get agreement on standardizing the way we describe things. And morphology is an example for one thing that has been famously ignored during all the time. The fact that the little ambiguous dash ("-") is still so popular to annotate morphemes all over the world just underlines it, and it is clear that even in language-specific terms, without having any larger picture in mind, we can't go on with this...

levmichael commented 8 years ago

There are many interesting observations in the upstream thread, but I'd like to keep our eye on the (shorter term) ball: so, @amaliaskilton and I are on board @thiagochacon's boundary symbols; it would be nice to hear if @gomezimb is similarly OK with them, and especially what @nataliacp thinks from the RefLex side. It would be really nice to have this wrapped up in the next day or two, before I drop out of internet contact until May, so that we can be sure to be in agreement for the next phase of data processing and FLEx importation.

gomezimb commented 8 years ago

It's OK with me.

nataliacp commented 8 years ago

I have no problem with any symbols being in the non-aligned fields. I would like to reiterate though my reservations of putting all sorts of dash-like things in FUN. Would they be considered in automatic alignment or not? and if yes, would it matter if they are a clitic or an affix symbol? As I said before, I think these things diacrhronically break down and we shouldn't overcomplexify the quasi-phonemic representations. So, my vote is either nothing in the FUN or just one kind of dash, which afterwards can be changed manually in Reflex if necessary. Unless the automatic alignment algorithm will actually consider all kinds of boundaries as boundaries and be able to align across them even if they are different. In that case, effectively for lingpy we have one kind of boundary, which is a different shape depending on the synchronic situation and everything is ok. @LinguList can you clarify what is happening with all these boundaries in the automatic alignment algorithm?

gomezimb commented 8 years ago

When I wrote that it was OK with me, I meant that it is not a problem if someone enters those signs in my list, not me. Same for the tones. My data were completed thanks to Amalia because I'm handicapped, I cannot properly use my wright hand. And I've never used Reflex, Flew or whatever.

krstenzel commented 8 years ago

Hi all,

I’ve been following the discussion as best I could and have found it interesting, but I’m now completely confused as to what needs to be done with my data. Both KOT and WAI data were entered by my student according to the original guidelines. What am I supposed to do now? If everything now needs to entered differently, I will need a very clear new guide with examples of how things should look. I have a lot on my plate and no idea when I will be able to reenter data for two languages . . . .

Best,

Kris

From: gomezimb [mailto:notifications@github.com] Sent: Monday, March 7, 2016 5:58 AM To: digling/tukano-project Subject: Re: [tukano-project] Morpheme boundaries (#23)

When I wrote that it was OK with me, I meant that it is not a problem if someone enters those signs in my list, not me. Same for the tones. My data were completed thanks to Amalia because I'm handicapped, I cannot properly use my wright hand. And I've never used Reflex, Flew or whatever.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-193186696 .

thiagochacon commented 8 years ago

I also think FUN should be stripped off of morpheme boundaries, but perhaps in that case we will loose information that is important for alignment and analysis of sound change. But I would prefer a easy solution now, and just go without morpheme boundaries in FUN.

@nataliacp has raised an interesting empirical question, whether sound changes are sensitive to different kinds of morpheme boundaries. Based on lexical phonological rules, and how sound patterns differ amogn different kinds of morpehemes, I would think yes. Hence if we can have more informartion, the better it would be for rendering more details in the comparative analysis.

Regarding the cross-linguistic validity of these notations, I think notions such as affixes, clitics, etc., are cross-linguistic valid concepts, just like causatives, passives, absolutive, etc., but whose proper definition depends on language specific issues. For instance, if (some kinds of) clitics are phonologically bound to a host but with free syntax, each language will define what means to be "phonologically bound" and how free that morpheme is in syntax.

However, while concepts such as "causative", "passive" etc., are defined in functional terms, morpheme types must be defined in formal terms. Languages are more different in their forms then in their functions, structuralists would say. So a proper definition of morpheme types require a "relative" definition of formal properties (e.g. affix are phonologically "more" bound than clitics). If a language does not show formal differences between bound morphemes, then I think one should use only -> and <- affixal symbols. New symbols could be used as formal differences between morpheme types get more complex.

amaliaskilton commented 8 years ago

We would expect sound change to be surface sensitive to morph boundaries in two cases - where a sound change targets a phone that synchronically appears only in certain morphological contexts (eg in Mai, [d] only appears morph initially) and due to analogy. We are already representing morphologically conditioned phonology in the FUN form by altering the segments/tones to reflect basic phon processes, so I don't think we need to worry about preserving the morph boundary info by other means in FUN. I think I made this point previously some days ago in this thread.

I'd like to ask as well: What do we hope to gain by adding more detail to the FUN representations? Improved automatic alignments, or a more complete/more accurate set of cognate sets and sound changes at the end of the project? I get the impression it is the second, and if so, what matters is the total representation of the data in reflex (including fields such as comments and grammatical notes) and the phonological/morphological and language family knowledge of the people who make the cognate sets. For the languages that I know, it doesn't matter to me for cognate set purposes how the morphology is represented in reflex: I already know it and will refer to it in making/correcting alignments. This does not mean that the reflex reps are not important, only that they are not the only source of information that we as language family specialists will have at our disposal in making cognate sets.

On Monday, March 7, 2016, thiagochacon notifications@github.com wrote:

I also think FUN should be stripped off of morpheme boundaries, but perhaps in that case we will loose information that is important for alignment and analysis of sound change. But I would prefer a easy solution now, and just go without morpheme boundaries in FUN.

@nataliacp https://github.com/nataliacp has raised an interesting empirical question, whether sound changes are sensitive to different kinds of morpheme boundaries. Based on lexical phonological rules, and how sound patterns differ amogn different kinds of morpehemes, I would think yes. Hence if we can have more informartion, the better it would be for rendering more details in the comparative analysis.

Regarding the cross-linguistic validity of these notations, I think notions such as affixes, clitics, etc., are cross-linguistic valid concepts, just like causatives, passives, absolutive, etc., but whose proper definition depends on language specific issues. For instance, if (some kinds of) clitics are phonologically bound to a host but with free syntax, each language will define what means to be "phonologically bound" and how free that morpheme is in syntax.

However, while concepts such as "causative", "passive" etc., are defined in functional terms, morpheme types must be defined in formal terms. Languages are more different in their forms then in their functions, structuralists would say. So a proper definition of morpheme types require a "relative" definition of formal properties (e.g. affix are phonologically "more" bound than clitics). If a language does not show formal differences between bound morphemes, then I think one should use only -> and <- affixal symbols. New symbols could be used as formal differences between morpheme types get more complex.

— Reply to this email directly or view it on GitHub https://github.com/digling/tukano-project/issues/23#issuecomment-193272722 .

levmichael commented 8 years ago

1) So it seems that we are in agreement that there will be no morpheme boundaries in the quasiphonemic (FUN) representation, which means that none of us (including @krstenzel and @gomezimb) need to worry about altering anything in the quasiphonemic representation, since our original convention was not to include morpheme boundaries there.

2) It looks like we're also in agreement that we will enrich the phonemic representation with the boundary symbol set originally proposed by @thiagochacon. For people who are entering/curating data for languages that they are familiar with, this task should only take a couple of hours (unless they are unsure about things like affixal vs. clitic status of particular morphemes). I agree with @krstenzel that we will want a very explicit set of instructions/conventions regarding this process. And I assume, @nataliacp, that this enrichment needs to be carried out prior to RefLex importation, right?

thiagochacon commented 8 years ago

great!

digling / tukano-project

Morpheme boundaries #23