Closed aquatiko closed 5 years ago
One error that I was facing is initialisation of MERGESYMBOL
with some unicode character (say \leftrightarrow
), for now I have initialised it with '~'
. But this might cause some unexpected result if the input string contains it. Still thinking how to fix it.
One error that I was facing is initialisation of
MERGESYMBOL
with some unicode character (say\leftrightarrow
), for now I have initialised it with'~'
. But this might cause some unexpected result if the input string contains it. Still thinking how to fix it.
You use a unicode character that can't appear in text. Like something from the private use area e.g. https://www.fileformat.info/info/unicode/char/e302/index.htm
You'll need to rebase this should be a simple rebase. Also how do you feel about restructuring this to use the fast new tokenbuffer that was just merged. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl
I don't think this will gain anything from the more complex tokenisation infra, given how simple it is. But you definitely should avoid *
for string concatenation. Instead have a look at IOBuffer
.
@oxinabox I have pushed the asked changes :)
Cool. the IOBuffer
is both clearer and faster.
I will review more closely once tests and docs are in.
A note on format (which I just noticed now). Prefered spacing is:
foo(a, b)
, not foo(a,b)
, always space after a comma in a function call1 + 1
, not 1+1
, basically space around all operatorsfoo(a, b; bar=1)
not foo(a, b; bar = 1
)(at some point I will have a style guide linked in the repo.)
Can we have more tests?
In particular some with:
I suspect multibyte unicode characters may be a problem as not all positions in a string can be indexed to
Take a read of https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8-1 and https://docs.julialang.org/en/v1/base/strings/#Base.thisind
Merging #14 into master will increase coverage by
4.57%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #14 +/- ##
=========================================
+ Coverage 75.92% 80.5% +4.57%
=========================================
Files 6 7 +1
Lines 162 200 +38
=========================================
+ Hits 123 161 +38
Misses 39 39
Impacted Files | Coverage Δ | |
---|---|---|
src/reversible_tokenize.jl | 100% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 41d4f1b...62e902d. Read the comment docs.
@oxinabox Incorporated asked additions :)
Looking pretty good,
I think it just needs these few minor things and adding to the readme.
LGTM, thanks!
Shouldn’t the function name be ...tokenize
like the others, not ...tokenizer
?
Bother, missed that. Yes
A simple method for a reversible tokenizer.
Closes: https://github.com/JuliaText/WordTokenizers.jl/issues/12 Closes: https://github.com/JuliaText/TextAnalysis.jl/issues/107