cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Refactored error handling #8

Closed xrotwang closed 7 years ago

xrotwang commented 7 years ago

As outlined in #7, handling of tokenization errors is refactored to follow the way python handles encoding/decoding errors.

Tokenizer.__call__ gains a keyword argument errors which accepts a string, specifying the desired behaviour, defaulting to 'replace'. Following the python practice, U+FFFD REPLACEMENT CHARACTER is used as default to signal replacement.

closes #7

codecov-io commented 7 years ago

Current coverage is 84.45% (diff: 100%)

Merging #8 into master will increase coverage by 47.20%

@@             master         #8   diff @@
==========================================
  Files             9          8     -1   
  Lines           784        373   -411   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            292        315    +23   
+ Misses          492         58   -434   
  Partials          0          0          

Powered by Codecov. Last update 8723c75...41dd081