digling / burmish

LingPy plugin for handling a specific dataset
GNU General Public License v2.0
1 stars 1 forks source link

Data by Nishi1999 #77

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

Nishi is another dataset, just as Mann1998, consisting of morphemes. We need to do the following

The data is pages 100-107 of Nishi 1999 (Four Papers on Burmese: Toward the history of Burmese (the Myanmar langauge) Tokyo: Institute for the Study of Language and Cultures of Asia and Africa (ILLCAA), Tokyo University of Foreign Studies). These data amount to 359 proposed cognates among eight languages, viz. Written Burmese, Spoken Burmese, Achang, Xiandao, Zaiwa, Leqi, Langsu, and Bola. The non-Burmese data are cited from "ZYC, except fora few Achang and Zaiwa forms which are supplied from (Xu and Xu 1984) and (Dai and Cai 1985). Note that entries of all the four Burmish languages, Burmese (and Mod. WB. transliterated by the Beijing method), Achang, Zaiwa and Langsu contained in ZYHC are supplied by the same authors as those in ZYC" (Nishi 1999: 96). (It looks like he also cites from Dai, et al. 1991).

The phoneme inventories are from the same work pp. 90-94. Nishi uses ñ in place of a sign the Chinese use for palatal n (not the usual one) and he uses ï for the apical vowel. He marks irregular cognates in bold, and he notes with the abbreviations x/x, d/c, and d.

LinguList commented 7 years ago

this is a first automatically created profile of sequences that need to explained.

Nishi1999.xlsx

The data looks rather messy, unfortunately, as there are many idiosyncratic characters, it seems.

nh36 commented 7 years ago

Nathan has made a new version of Nishi. The data are now cleaner. Mattis must recompile the orthography profile for Nathan to check.

nh36 commented 7 years ago

Nishi.ods.zip

LinguList commented 7 years ago

Short question: the bold things in Nishi, are they meaningful? If so, I'd try to search-and-replace them by * and code them differently, maybe adding a note for those words...

LinguList commented 7 years ago

Looks much nicer, just preparing the concept list. There was one hidden row, though, you should have a look (but I resolved it): line 367 was hidden, and so it was exported, but with many blank lines. I now moved it up where it belongs, and also corrected one nishi gloss: to dye instead of dye (cloth), line 100. BTW: it's good having the original source noted there, as in this way, we can trace back to Sun 1991 (that is ZMYYC, right?).

LinguList commented 7 years ago

so here is the currently mapped data for Nishi, automatic mapping, percentage: 0.79, not bad actually:

Nishi-1999-mapped.ods.zip

I leave that to @nh36 to have a closer look at it, but will later double-check your cleaned version. The algorithm is better now, but also yields a lot of possibilities, yet I consider this as important, as we should be as strict as possible with those mappings.

LinguList commented 7 years ago

and here's the new test for the profile. Not much changed, to be honest, but it looks clearer now. I suppose, it's time to just work with the data as is, there are some five exceptions with tones, but I will handle them explicitly once I run the profile to re-create the data.

Nishi1999-prf.xlsx

nh36 commented 7 years ago

The things in bold are those that Nishi himself identified to be irregular. ZMYYC is Sun1991, it is his usual source. But note that he uses several other sources, when he thought he got better data there.

Dr Nathan W. Hill Reader in Tibetan and Historical Linguistics Department of China & Inner Asia and Department of Linguistics SOAS, University of London Thornhaugh Street, Russell Square, London WC1H 0XG, UK Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 9:09 AM, Johann-Mattis List <notifications@github.com

wrote:

Short question: the bold things in Nishi, are they meaningful? If so, I'd try to search-and-replace them by * and code them differently, maybe adding a note for those words...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digling/burmish/issues/77#issuecomment-258599919, or mute the thread https://github.com/notifications/unsubscribe-auth/AIdHxceIXnXfJHjGd8wMpjw-c1ZBkPaPks5q7EflgaJpZM4KmQH1 .

LinguList commented 7 years ago

Just saw: even better, as you halved the number of rows, so this is really working nicely now!

LinguList commented 7 years ago

Once we have linked Sun1991 to concepticon, we can directly compare across the sources (also provided we have orthography profiles for Sun1991).

BTW: this workflow we are following up now could definitely be optimized. I think it is time I start thinking about a script to run to create an initial orthography profile for a given dataset. I'll make that an issue, and I'll probably handle it by writing a new function for either lingpy or the original orthography profile code, as it is of general interest to users, I'd say.

nh36 commented 7 years ago

So, I guess I will hold off on the concepticon mapping and orthography profile for Nishi1999, since they can instead be done as part of Sun1991 or using a (semi-)automatic system that you will develop. Oder?

LinguList commented 7 years ago

Every source has it's own right, and as far as I can see, we don't know whether Sun1991 uses the same concept labels, and the same orthography. Nishi may have well adjusted those. And since Sun1991 is also originally Chinese, there may be some divergences in translation. So I prefer to do the work two times, on time for Nishi and one time for Sun and then check the overlap, which will also be interesting as a scientific study on the sociology of research, as I guess we may find some coding errors, and it is interesting to see how they could influence an analysis.

nh36 commented 7 years ago

Ok, in that case I will do the Nishi. You can work on automating things, but at least the Nishi will be done already.

Dr Nathan W. Hill Reader in Tibetan and Historical Linguistics Department of China & Inner Asia and Department of Linguistics SOAS, University of London Thornhaugh Street, Russell Square, London WC1H 0XG, UK Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 10:14 AM, Johann-Mattis List < notifications@github.com> wrote:

Every source has it's own right, and as far as I can see, we don't know whether Sun1991 uses the same concept labels, and the same orthography. Nishi may have well adjusted those. And since Sun1991 is also originally Chinese, there may be some divergences in translation. So I prefer to do the work two times, on time for Nishi and one time for Sun and then check the overlap, which will also be interesting as a scientific study on the sociology of research, as I guess we may find some coding errors, and it is interesting to see how they could influence an analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digling/burmish/issues/77#issuecomment-258602574, or mute the thread https://github.com/notifications/unsubscribe-auth/AIdHxf7U08ktZ9KDDNRlV8QlkdLjqcaXks5q7FbzgaJpZM4KmQH1 .

LinguList commented 7 years ago

yep, exactly what I was thinking. There is a possibility that we are doing more than necessary here, but I prefer taking the risk over risking to further change the data in any way. Sun1991 is extremely interesting for us, but Nishi is a lower-hanging fruit and also important for the QPA, as with this source, and with Mann, we have then concrete tests where we can compare with your analysis of Huang1992. Already that comparison will be some research that has not been carried out yet, I'd say.

nh36 commented 7 years ago

Here is the Nishi concepticon mapping. In many cases I have left some ambiguities, generally this is where Nishi seems to want to combine two concepticon concepts. In some cases I changed the automatic map to '???' because none of the concepticon concepts seems to work (e.g. dream vi, which is certainly not the same as dream (something).

I also attach the Nishi orthography profile, but I am not sure it is done correctly. I have fixed mistakes where I have seen them (mostly changing t s into ts and things like that), but I find it odd that whole words come up unsegmented into initial and final.

Dr Nathan W. Hill Reader in Tibetan and Historical Linguistics Department of China & Inner Asia and Department of Linguistics SOAS, University of London Thornhaugh Street, Russell Square, London WC1H 0XG, UK Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 10:19 AM, Johann-Mattis List < notifications@github.com> wrote:

yep, exactly what I was thinking. There is a possibility that we are doing more than necessary here, but I prefer taking the risk over risking to further change the data in any way. Sun1991 is extremely interesting for us, but Nishi is a lower-hanging fruit and also important for the QPA, as with this source, and with Mann, we have then concrete tests where we can compare with your analysis of Huang1992. Already that comparison will be some research that has not been carried out yet, I'd say.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digling/burmish/issues/77#issuecomment-258602795, or mute the thread https://github.com/notifications/unsubscribe-auth/AIdHxWRjobZOoQ72A7z3Li2ju2X-RhCiks5q7FhBgaJpZM4KmQH1 .

LinguList commented 7 years ago

alright, thanks! As to the point with whole words unsegmented: this means that the algorithm did not find a vowel. This is easy to explain as, as I said, LingPy doesn't know that an a with a dot under it is a vowel, as this is the first time LingPy is confronted with it. I should add the dotty things to LingPy, but it is a bit tedious and not fun, so I know I need to do it, but I don't want to do it now. And I keep being annoyed by the zipfian distributions. LingPy recognizes an enormous amount of sounds correctly now, but each dataset keeps having just one other sound I did not see before. The rule was: if the algor does not find a vowel, it would just show the full word form, as works in cases of syllabic nasals, for example, where we need to re-map anyway. I'll work from there and prepare an updated version of the profile, so you can see what I would do in those cases.

LinguList commented 7 years ago

Ah: could you upload the profile and the mapping on the website? If you attach it in an email, it does not get submitted...

nh36 commented 7 years ago

Nishi1999-prf corrected by NH.xlsx

nh36 commented 7 years ago

Nishi-1999-mapped corrected by NH.ods.zip

nh36 commented 7 years ago

Nathan still needs to--

check phoneme inventories in the original source and if there are, type them off provide a small description of the dataset

nh36 commented 7 years ago

Here is the phoneme inventory for Nishi 1999, I am not sure it is the format you will want, but it should work one way or another.

x means 'doesn't have' check means 'has' airplane means 'only in loans'

LinguList commented 7 years ago

Nishi phoneme inventory.xlsx

Excellent, I just uploaded it here, but have it locally as well. I'll change tone letters to upper case, but otherwise, the format is very convenient, and it probably directly qualifies as a orthography profile (but will need to test this).

nh36 commented 7 years ago

So, this issue can be closed, right? Although my data description, at the top, probably needs to be moved somewhere else.

LinguList commented 7 years ago

don't close right now, as I'll need to add the profile to the repository, I just assigned myself to get this finalized.

nh36 commented 7 years ago

Please send an update on this thread.

LinguList commented 7 years ago

Okay, Nishi1999 is the next target, as Mann1998, Nishi1999 and Huang1992 seem to be central (and the other Chinese source, whose name I keep forgetting...).

nh36 commented 7 years ago

This issue may now be superseded. Please review and confirm.

LinguList commented 7 years ago

Yes, the issue which follows on this is the wrong concept list in the csv-file #90, all relevant data should be there. Please look in the folder called "raw" in Nishi for the csv-file I have been using (downloading and opening in openoffice should be straightforward, I hope).