JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

add reverse tokenizer #14

Closed aquatiko closed 5 years ago

aquatiko commented 5 years ago

A simple method for a reversible tokenizer.

Closes: https://github.com/JuliaText/WordTokenizers.jl/issues/12 Closes: https://github.com/JuliaText/TextAnalysis.jl/issues/107

aquatiko commented 5 years ago

One error that I was facing is initialisation of MERGESYMBOL with some unicode character (say \leftrightarrow), for now I have initialised it with '~'. But this might cause some unexpected result if the input string contains it. Still thinking how to fix it.

oxinabox commented 5 years ago

One error that I was facing is initialisation of MERGESYMBOL with some unicode character (say \leftrightarrow), for now I have initialised it with '~'. But this might cause some unexpected result if the input string contains it. Still thinking how to fix it.

You use a unicode character that can't appear in text. Like something from the private use area e.g. https://www.fileformat.info/info/unicode/char/e302/index.htm

oxinabox commented 5 years ago

You'll need to rebase this should be a simple rebase. Also how do you feel about restructuring this to use the fast new tokenbuffer that was just merged. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl

MikeInnes commented 5 years ago

I don't think this will gain anything from the more complex tokenisation infra, given how simple it is. But you definitely should avoid * for string concatenation. Instead have a look at IOBuffer.

aquatiko commented 5 years ago

@oxinabox I have pushed the asked changes :)

oxinabox commented 5 years ago

Cool. the IOBuffer is both clearer and faster. I will review more closely once tests and docs are in.

A note on format (which I just noticed now). Prefered spacing is:

(at some point I will have a style guide linked in the repo.)

oxinabox commented 5 years ago

Can we have more tests?

In particular some with:

I suspect multibyte unicode characters may be a problem as not all positions in a string can be indexed to

Take a read of https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8-1 and https://docs.julialang.org/en/v1/base/strings/#Base.thisind

codecov-io commented 5 years ago

Codecov Report

Merging #14 into master will increase coverage by 4.57%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #14      +/-   ##
=========================================
+ Coverage   75.92%   80.5%   +4.57%     
=========================================
  Files           6       7       +1     
  Lines         162     200      +38     
=========================================
+ Hits          123     161      +38     
  Misses         39      39
Impacted Files Coverage Δ
src/reversible_tokenize.jl 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 41d4f1b...62e902d. Read the comment docs.

aquatiko commented 5 years ago

@oxinabox Incorporated asked additions :)

oxinabox commented 5 years ago

Looking pretty good,

I think it just needs these few minor things and adding to the readme.

oxinabox commented 5 years ago

LGTM, thanks!

jekbradbury commented 5 years ago

Shouldn’t the function name be ...tokenize like the others, not ...tokenizer?

oxinabox commented 5 years ago

Bother, missed that. Yes