Closed lfashby closed 3 years ago
Merging #50 (a0fb83f) into master (6e5cd9a) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #50 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 12 12
Lines 404 404
=========================================
Hits 404 404
Impacted Files | Coverage Δ | |
---|---|---|
src/segments/tokenizer.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6e5cd9a...a0fb83f. Read the comment docs.
Looks fine to me, although I'd suggest to add a test to this function to make sure the behavior is as expected (currently, the function is only indirectly tested, I think).
@lfashby thanks! Let me know, if/when you need this functionality in a released version of the package.
Thank you! @xrotwang It'd be great if you could do a pypi release with this change at your earliest convenience.
Just released with segments 2.2.0.
The tokenizer currently drops a glottalization diacritic (
ˀ
) in instances like the following:(
˦ˀ˥
is used to mark a contour tone in Vietnamese Wiktionary transcriptions.)This PR changes this behavior such that the output of the
Tokenizer
call above isʔ ɓ aː n ˧˩ # ŋ aː ˦ˀ˥
. All the tests still pass and as far as I can tell this change does not lead to any unwanted tokenization side-effects, but if anyone has alternative solutions to this problem feel free to let me know.