cldf-clts / pyclts

Apache License 2.0

11 stars 2 forks source link

Proposition for parsing multi-character diacritics (see issue #45) #46

Open XachaB opened 3 years ago

XachaB commented 3 years ago

This is a potential implementation for parsing multi-character diacritics, such as ultra-long (see issue #45 )

Currently, diacritics are parsed by iterating on each character leftover after matching sounds. This does not allow for recognizing multi-char diacritics.

One alternative is to construct regexes for diacritics (for each type of sound, a regex for pre diacritics, a regex for post diacritics), and use them to split the string of remaining diacritics. This is what I am doing here, with the exact same functionality otherwise.

LinguList commented 3 years ago

Sorry, @XachaB, I only now saw this PR. I have been very busy lately. I'll check this later next week and also answer on the issue.

codecov-commenter commented 3 years ago

Codecov Report

Merging #46 (07b85b7) into master (c7420d9) will increase coverage by 0.00%. The diff coverage is 97.05%.

@@           Coverage Diff           @@
##           master      #46   +/-   ##
=======================================
  Coverage   94.65%   94.66%           
=======================================
  Files          33       33           
  Lines        1760     1780   +20     
=======================================
+ Hits         1666     1685   +19     
- Misses         94       95    +1

Impacted Files	Coverage Δ
src/pyclts/transcriptionsystem.py	`96.55% <97.05%> (-0.17%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c7420d9...07b85b7. Read the comment docs.

xrotwang commented 3 years ago

I think I'd agree with @LinguList that explicitly listing ultra-long phonemes in the respective data files would be more transparent, than wholesale acceptance of stacked diacritics. While CLTS BIPA isn't as strict as e.g. Concepticon in only aggregating phonemes that have been encountered "in the wild", I'd still say that stacked diacritics may be more often signaling problems with the data than actual phonemes. In addition, I think that the code in transcriptionsystems is already more complex than I'd like it to be. So any additions to it feel like going into the wrong direction.

XachaB commented 3 years ago

Ok, that makes sense.

I understand well the wish not to complexify the system, especially for a single two-char diacritic. Then I won't lose time trying to find why there's a test that's not passing -- glad I didn't do so earlier too. Can I then make a PR to CLTS (not pyclts) for a set of extra-long sounds in the respective bipa data files ?

I do need some way for these extra-long diacritics not to be ignored by pyclts.

LinguList commented 3 years ago

Sure, @XachaB. thanks in advance for your help here! We look forward to the PR in CLTS!