cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation

Apache License 2.0

31 stars 13 forks source link

Major Refactoring #35

Closed xrotwang closed 6 years ago

xrotwang commented 6 years ago

handle segmented data as list of lists internally
make Tree.parse re-sume parsing after error
project scaffolding updated

closes #27 closes #32

codecov-io commented 6 years ago

Codecov Report

Merging #35 into master will decrease coverage by 0.11%. The diff coverage is 96.92%.

@@            Coverage Diff             @@
##           master      #35      +/-   ##
==========================================
- Coverage   99.24%   99.12%   -0.12%     
==========================================
  Files           6        7       +1     
  Lines         264      229      -35     
  Branches       56        0      -56     
==========================================
- Hits          262      227      -35     
- Misses          0        2       +2     
+ Partials        2        0       -2

Impacted Files	Coverage Δ
src/segments/metadata.py	`0% <0%> (ø)`
src/segments/errors.py	`100% <100%> (ø)`	:arrow_up:
src/segments/tokenizer.py	`100% <100%> (ø)`	:arrow_up:
src/segments/__main__.py	`100% <100%> (ø)`	:arrow_up:
src/segments/tree.py	`100% <100%> (+3.33%)`	:arrow_up:
src/segments/util.py	`100% <100%> (ø)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 012c1ff...647626a. Read the comment docs.

tresoldi commented 6 years ago

Seems great. You forgot to mention that this also implements REPLACEMENT_MARKER, which, while perhaps trivial, is a very positive change in my view. +1 for merging

xrotwang commented 6 years ago

I like how this condenses the main functionality into

If we now add a bunch of tests with cases we encounter in lexibank, there's not too much risc of running into problems with backwards incompatibility, I guess.

xrotwang commented 6 years ago

As far as I'm concerned, we could even make orthography profile and input/output from the LingPy tutorial part of the segments test suite.

xrotwang commented 6 years ago

@bambooforest Ok with merging? From my point of view, this is backwards compatible - although I did remove some non-private methods on Tokenizer (which I think should have been private to begin with :) ). If so, this would make 1.2, otherwise I'd also be happy to start the 2.x series with this.

LinguList commented 6 years ago

Short question: is the new behavior allowing for lists as lists already implemented here, or do we go on in the way we know? If the former, could you provide a short example for this?

xrotwang commented 6 years ago

@LinguList the external interface has not changed - input and output is still strings; only internally data is handled as lists. I.e. .split() and ' '.join() are only called once and in the outermost layer of code, when receiving input and when creating output.

xrotwang commented 6 years ago

@LinguList The main goal of this refactoring is transparency: Passing data as strings between private methods meant splitting and joining multiple times. It also meant that (configurable) word and segment boundary markers had to be known to (and used by) each private method, which seemed error prone.

bambooforest commented 6 years ago

@xrotwang - I guess you have to act fast around here :-) Anyway, thanks for the update.

LinguList commented 6 years ago

@xrotwang, I see, and I agree that this is quite important. I remember having some issues with this in the past.

xrotwang commented 6 years ago

@bambooforest sorry, for rushing this. I have a long list of stuff depending on this, which made me nervous. I truly think this is backwards compatible, though :)