PhyloStar / CogDetect

A lightweight library for cognate clustering, converting IPA sequences to sound classes, computing distances between languages
2 stars 1 forks source link

Please file issues for insufficiencies of lingpy #16

Open LinguList opened 7 years ago

LinguList commented 7 years ago

I just figured when reading the documentation of an extension of ipa2tokens in this repo that you suppose that linpgy splits strings that are not identical with the input strings when removing whitespace. If this is really happening, it should be handled from within lingpy, and I would need some triggers to confirm. Note that you should make sure to normalize to one unicode version, as we do in lingpy, and that this may trigger differences (currently, you are not normalizing in the script!). Other reasons I would not know of, but it would be extremely valuable to be told those differences, so we can address them.

Anaphory commented 7 years ago

True! The difference is that LingPy's ipa2tokens removes (and it's usually reasonable it does that) the - and . characters that tell it the ends of segments.

I'm fully aware that what I use here is quite a dirty and ad-hoc way to do what I wanted to do, it is supposedly only intermittent („Nichts währt länger als eine kurze provisorische Lösung“, though). I'll hopefully think of something better and suggest it to you at some point – I assume it would be an optional argument to ipa2tokens which tells it not to remove those characters but do something else with them (Be their own token? Merge with previous token? Merge with following token?)

LinguList commented 7 years ago

Ah, I see, this is of course a feature rather than a bug, as dots serve as vowel break markers and I don't see why to keep them, although one could modify to keep the dot. We have even new annotations, which allow to keep original stuff but will convert parts, using a "source/target" annotation, which would allow to mark laryngeal in IE, h₂ as h₂/ə, meaning: lingpy will read it as schwa, while the segment is still laryngeal 2. We now also use clear-cut orthography profiles to convert from orthography to ipa-like representations. I think as far as this repo is concerned, it would be useful to have a larger discussion on that, so you know where we are right now and may explain us why you might want to diverge from that.

In terms of implementation, the dot may be hard-coded, but one needs to look up the original code. In fact, you can pretty much adapt ipa2tokens to many, many of your needs, and I think the tutorial online, that is, where the function is described, may even give further instructions. If not, let me know, and I'll explain some more about the basic ideas behind it.

LinguList commented 7 years ago

BTW, on cldf, I recommend this page, as it is where I will develop the major specifications/recommendations which are usually on-line with what lingpy/edictor handle.

Anaphory commented 7 years ago

Thanks! That's helpful.