cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Averts IndexError on strings that are only diacritics #49

Closed lfashby closed 4 years ago

lfashby commented 4 years ago

Similar to #46:

import segments
segments.Tokenizer()("ʲ", ipa=True)

Will throw an IndexError. This change will skip trying to combine "ʲ" with an empty result list by checking if graphemes is longer than one character.

codecov-io commented 4 years ago

Codecov Report

Merging #49 into master will not change coverage. The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master    #49   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          12     12           
  Lines         393    394    +1     
=====================================
+ Hits          393    394    +1
Impacted Files Coverage Δ
src/segments/tokenizer.py 100% <ø> (ø) :arrow_up:
tests/test_tokenizer.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b76d562...52a4843. Read the comment docs.

xrotwang commented 4 years ago

I'm inclined to merge this, since this treatment would be consistent with what we did in #46. But after merging #46 I started thinking whether raising an error isn't a more appropriate default? What do you think @bambooforest ? Or should we add a strict flag to trigger the old behaviour?

kylebgorman commented 4 years ago

We (as downstream users) can deal with either. That said, I think we'd weakly prefer strings like this and the one in #46 to to get identity-transformed, since otherwise we have to catch the error and then force that interpretation.

bambooforest commented 4 years ago

Having had to deal with these errors a lot, I'm inclined to agree with @kylebgorman and identity-transform this type of input @xrotwang (also as you mention because it follows #46 ). But for the user, this might implicitly suggest that we do something sensible with ipa=True, as in check that it's valid IPA input (whatever that is ;) , so I agree a strict flag would be nice so that it fails when some set of characters are introduced that aren't IPA -- but which ones?

https://github.com/unicode-cookbook/cookbook/tree/master/book/tables

Strict? Valid? Either or? Then we come around to user-defined OPs.

xrotwang commented 4 years ago

Thinking a bit more, I'd say identity transform is the only reasonable thing to do. After all, we don't do any checking whether the string we add modifiers to is anything sensible - other than not empty ...

Steven Moran notifications@github.com schrieb am Di., 19. Nov. 2019, 18:08:

Having had to deal with these errors a lot, I'm inclined to agree with @kylebgorman https://github.com/kylebgorman and identity-transform this type of input @xrotwang https://github.com/xrotwang (also as you mention because it follows #46 https://github.com/cldf/segments/issues/46 ). But for the user, this might implicitly suggest that we do something sensible with ipa=True, as in check that it's valid IPA input (whatever that is ;) , so I agree a strict flag would be nice so that it fails when some set of characters are introduced that aren't IPA -- but which ones?

https://github.com/unicode-cookbook/cookbook/tree/master/book/tables

Strict? Valid? Either or? Then we come around to user-defined OPs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cldf/segments/pull/49?email_source=notifications&email_token=AAGUOKCDVUMPVOFNJHR2YJ3QUQMPLA5CNFSM4JO47LRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEO6SKQ#issuecomment-555608362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKAHTIGZ7O35MKGPWXDQUQMPLANCNFSM4JO47LRA .

xrotwang commented 4 years ago

@lfashby @kylebgorman just released v2.1.3, including this (and the previous) PR: https://pypi.org/project/segments/

kylebgorman commented 4 years ago

TY.

On Thu, Nov 21, 2019 at 9:31 AM Robert Forkel notifications@github.com wrote:

@lfashby https://github.com/lfashby @kylebgorman https://github.com/kylebgorman just released v2.1.3, including this (and the previous) PR: https://pypi.org/project/segments/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cldf/segments/pull/49?email_source=notifications&email_token=AABG4OPD4NTQHZHITWEDETDQU2LVZA5CNFSM4JO47LRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2NK4A#issuecomment-557110640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLKBNYS63C64XNDDSDQU2LVZANCNFSM4JO47LRA .