cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
14 stars 3 forks source link

Ruhlen's database: an example for NAPA #8

Open LinguList opened 7 years ago

LinguList commented 7 years ago

If you look at this example, and compare it with the NAPA description. It seems, that by adding the NAPA, we could link a lot of the data in Ruhlen's database, similar to the way we use IPA to link to phoible...

LinguList commented 7 years ago

Very nice description provided by Ruhlen: http://starling.rinet.ru/typology-descr.pdf He seems to be aware of IPA, but it is clearly NAPA what he ends up with.

LinguList commented 6 years ago

started to work on this database, but we have little matches, as the inventories are just idiosyncratic and messed up.

tresoldi commented 6 years ago

I also spent some good time trying to parse starling's data. I can't really understand why they don't provide the sources, considering they seem quite open in terms of access and collaboration (I was actually going to ask you about it in terms of Lexibank, as I believe you told you were in contact with Starostin, but I wouldn't have had the time to do something like a Lexibank version of Pokorny...).

In any case, it is something one can do with some manual tweaking, such as selecting the fields that include segments and removing descriptions/comments. I think I could help with that, but are you aware of Creanza (2015) supplementary material at http://www.pnas.org/content/112/5/1265.full?tab=ds ? She used Ruhen's data and, while we cannot reproduce the steps towards her TSV files, parsing it would be really easy; maybe this data could be used as a "proxy" for Ruhen?

LinguList commented 6 years ago

My parser for Ruhlen is pretty bad, and involves quite some work, as I had to replace the html-superscripts that are used inconsistently (also in the excel file they offer) by regular unicode. The last stand is in the repo and I find some 300 odd sounds, but not all that they have in Fonetikode (although they have only spurious mappings between phobile and GLD). My code is dating back to python2-times, and I was only slightly revising it today, so it is still messy and not really working (or let's say: we do not get it cleaned).

I think, George Starostin cannot much help here: they have the data probably in Starling, where the errors are already there, and they offer them in excel, from where Creanza was taking the data, I assume.

But you are actually right: we can just use the current database of Creanza, and this is it. No need to do more. The phonetic system is a nice mix of NAPA and other traditions, so we'll need to figure out some mappings, but one might be able to manage to at least link to some of the notorious things, now that I introduced some new aliases in bipa. So it's settled: we use Creanza's version of Ruhlen. Perfect!

tresoldi commented 6 years ago

Great! I'll write a scrap_creanza and do a PR. This one should be really, really easy.

We can go back to Ruhlen and Staling later. The latter, I was just checking, seems to be "manually" extracted (I'd guess, with Perl) from a Clipper executable, i.e., an xBase (something I haven't seen since a programming gig when I was a teenager back in the '90s). I would be tempted to ask for the full .DBF files (not the ones on the download page) and write them a Python extractor (there are actually -- and luckily -- Python libraries for dealing with .DBF)...

LinguList commented 6 years ago

It is even easier with Starling, where you can export from within Starling to nice CSV. The problem is that starling is like word, allowing to use superscripts and other font information, so you'll have to clean it anyway.

But the scrape-creanza is most welcome. For automatic linking, we can't use plain BIPA, as I have been doing, so some fiddling with the data will be needed (I suppose they have different length markers, like ā is long vowel, not vowel with high tone, etc.) and I think that "/" is the same click as "|", but they use "/" for some unknown reason. Anyway, we're advancing...

LinguList commented 6 years ago

We'll need a basic set of rules to clean up the data, before we can close this.

tresoldi commented 6 years ago

We do, bit I couldn't proceed without Ruhlen's source. It is quite clear, for example, that the slash stands for the click pipe, but we have a source for that? An alternative would be to go back to starling Ruhlen's data, parse it and look for the cases we have assumptions for, effectively testing our hypothesis.

Em 4 de jan de 2018 11:25 AM, "Johann-Mattis List" notifications@github.com escreveu:

We'll need a basic set of rules to clean up the data, before we can close this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/21#issuecomment-355281350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9xYC-5jE3ygcZd2dchZOF_sCC0XQks5tHNFVgaJpZM4PTBHU .

LinguList commented 6 years ago

Well, I'd be more concerned with the errors introduced by some obviously wrongly encoded diacritics. But there are all sources for the data, you just need to go over the web-interface:

Each language also lists a source, so one could backtrace the ugly "/" characters by reading the sources, but I'd even be inclined to skip most parts and just take care of obvious ones, like "š", etc. which ARE in NAPA and also Tower of Babel's Global Lexicostatistical Database where they use their own alhpabet...

Surprisingly, Ruhlen says in the description of the dbase, that he's using IPA: http://starling.rinet.ru/typology.pdf

LinguList commented 6 years ago

In fact, I should've re-read this, as he offers a complete chart:

screenshot_2018-01-04_16-00-44

This shows even more that we're dealing with NAPA-like alphabets, given also the treatment of retroflex sounds.

tresoldi commented 6 years ago

Interesting, I will try to start doing it later. As for Ruhlen, maybe at first he used IPA on manual material and it was converted to a digital format by someone else? Maybe by more than one person? It is an hypothesis to have in mind while trying to decipher some diacritics.

Em 4 de jan de 2018 12:59 PM, "Johann-Mattis List" notifications@github.com escreveu:

Well, I'd be more concerned with the errors introduced by some obviously wrongly encoded diacritics. But there are all sources for the data, you just need to go over the web-interface:

Each language also lists a source, so one could backtrace the ugly "/" characters by reading the sources, but I'd even be inclined to skip most parts and just take care of obvious ones, like "š", etc. which ARE in NAPA and also Tower of Babel's Global Lexicostatistical Database where they use their own alhpabet...

Surprisingly, Ruhlen says in the description of the dbase, that he's using IPA: http://starling.rinet.ru/typology.pdf

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/issues/21#issuecomment-355303669, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar96zT-4ImRgh8Ve09FW-KXMLyTK9Bks5tHOc7gaJpZM4PTBHU .

LinguList commented 6 years ago

Yes, but since this is more for illustrational purposes, i.e., showing that the big transcription datasets only become usable if we provide some efforts of normalization (or ignore it), it should be sufficient if we reach some 70-80 percent of all symbols used. And this will be rather straightforward when just mapping the hacek-symbols and the dotted ones to retroflex characters. Maybe it's best to include this in the scraper script: download the data and check at the same time for the BIPA-equivalent, while correcting against a list of direct transformers (like š > ʃ, etc.).