Open XachaB opened 3 years ago
Sorry, @XachaB, I only now saw this PR. I have been very busy lately. I'll check this later next week and also answer on the issue.
Merging #46 (07b85b7) into master (c7420d9) will increase coverage by
0.00%
. The diff coverage is97.05%
.
@@ Coverage Diff @@
## master #46 +/- ##
=======================================
Coverage 94.65% 94.66%
=======================================
Files 33 33
Lines 1760 1780 +20
=======================================
+ Hits 1666 1685 +19
- Misses 94 95 +1
Impacted Files | Coverage Δ | |
---|---|---|
src/pyclts/transcriptionsystem.py | 96.55% <97.05%> (-0.17%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c7420d9...07b85b7. Read the comment docs.
I think I'd agree with @LinguList that explicitly listing ultra-long phonemes in the respective data files would be more transparent, than wholesale acceptance of stacked diacritics. While CLTS BIPA isn't as strict as e.g. Concepticon in only aggregating phonemes that have been encountered "in the wild", I'd still say that stacked diacritics may be more often signaling problems with the data than actual phonemes. In addition, I think that the code in transcriptionsystems
is already more complex than I'd like it to be. So any additions to it feel like going into the wrong direction.
Ok, that makes sense.
I understand well the wish not to complexify the system, especially for a single two-char diacritic. Then I won't lose time trying to find why there's a test that's not passing -- glad I didn't do so earlier too. Can I then make a PR to CLTS (not pyclts) for a set of extra-long sounds in the respective bipa data files ?
I do need some way for these extra-long diacritics not to be ignored by pyclts.
Sure, @XachaB. thanks in advance for your help here! We look forward to the PR in CLTS!
This is a potential implementation for parsing multi-character diacritics, such as ultra-long (see issue #45 )
Currently, diacritics are parsed by iterating on each character leftover after matching sounds. This does not allow for recognizing multi-char diacritics.
One alternative is to construct regexes for diacritics (for each type of sound, a regex for pre diacritics, a regex for post diacritics), and use them to split the string of remaining diacritics. This is what I am doing here, with the exact same functionality otherwise.