cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation

Apache License 2.0

31 stars 13 forks source link

Vietnamese tone contour grouping #50

Closed lfashby closed 3 years ago

lfashby commented 3 years ago

The tokenizer currently drops a glottalization diacritic (ˀ) in instances like the following:

import segments

segments.Tokenizer()("ʔɓaːn˧˩ ŋaː˦ˀ˥", ipa=True))
# Returns: ʔ ɓ aː n ˧˩ # ŋ aː ˦˥

(˦ˀ˥ is used to mark a contour tone in Vietnamese Wiktionary transcriptions.)

This PR changes this behavior such that the output of the Tokenizer call above is ʔ ɓ aː n ˧˩ # ŋ aː ˦ˀ˥. All the tests still pass and as far as I can tell this change does not lead to any unwanted tokenization side-effects, but if anyone has alternative solutions to this problem feel free to let me know.

codecov-io commented 3 years ago

Codecov Report

Merging #50 (a0fb83f) into master (6e5cd9a) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #50   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines          404       404           
=========================================
  Hits           404       404

Impacted Files	Coverage Δ
src/segments/tokenizer.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6e5cd9a...a0fb83f. Read the comment docs.

LinguList commented 3 years ago

Looks fine to me, although I'd suggest to add a test to this function to make sure the behavior is as expected (currently, the function is only indirectly tested, I think).

xrotwang commented 3 years ago

@lfashby thanks! Let me know, if/when you need this functionality in a released version of the package.

lfashby commented 3 years ago

Thank you! @xrotwang It'd be great if you could do a pypi release with this change at your earliest convenience.

xrotwang commented 3 years ago

Just released with segments 2.2.0.