diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

refactored PostProcessor #29

Closed maia closed 8 years ago

maia commented 8 years ago

This is a major refactoring of PostProcessor. Some method names need to be replaced, I'm unsure what they currently do, and others could probably be cleaned up (method_name18 and method_name19). I've also replaced regex checks by direct calls to .split (other than method_name13, I couldn't get that to work). Finally, instead of chaining flat_map it will now reduce over procs. This looks scary initially, but one can get used to it easily.

Please check if the two calls of reducing over procs can be merged - the logic of post_process, method_name3, full_stop_separated_tokens and method_name16 confuses me, and I'm not sure if there's a specific order or if everything could be run at once.

All tests pass.

diasks2 commented 8 years ago

Thanks! I really appreciate it. My only concern is that this seems to slow things down. This spec goes from 25s to 28.5s on my computer: https://github.com/diasks2/pragmatic_tokenizer/blob/master/spec/performance_spec.rb#L35-L47

Readability is important but in my opinion I think performance is possibly more important for this gem as I think a lot of use cases will be processing large amounts of text. So for changes that don't change the underlying functionality but slow things down I have some concern.

What are your thoughts?

maia commented 8 years ago

I'm unsure what the reason for this is - either it's the procs (possible, but I doubt it), or it's the change of various regex. E.g. I went from:

flat_map { |t| (t[0] == '‚' || t[0] == ',') && t.length > 1 ? t.split(/(,|‚)/).flatten : t }

to:

flat_map { |t| t.split(/^(,|‚)+/) }

While I believe the latter is the better solution, it will increase the necessity of optimized regex. E.g. quite a few methods would not even need to run if the token length is 1, and checking for string length might be much faster than running a regex.

What I've not tried yet is using Regexp#union to merge multiple regex into one larger regex and using this for token.split. I guess it could help a lot, I'll give it a shot.

But besides the speed issue (which we will tackle in the upcoming days and weeks) I strongly think that such a library should be maintainable. The original code was so complicated that even you might not be able to easily change/fix/add logic in a few months from now, and certainly noone else will. I'd therefor suggest to keep the code as understandable as possible, as it will allow other contributors to help.

diasks2 commented 8 years ago

Thanks, makes sense. Once we get the refactoring done let's focus on improving the speed. Thanks for all your help!

diasks2 commented 8 years ago

I tried to clean up the naming some: https://github.com/diasks2/pragmatic_tokenizer/commit/06bb017dd231365f97df77622f67dfdc8721f605

maia commented 8 years ago

Great, that makes it a bit more understandable. I'm still unsure about how post_process, method_name3, unified1 and full_stop_separated_tokens work together, can these be cleaned up or rewritten? Due to the calls to external classes one has to jump around quite a bit between these methods.

Oh, and do you manage to remove the if/else ternary from split_emoji (and then move it to the regex constants)? I failed when trying so. And: if you could clean up unknown_period1?and unknown_period2?, I'd be grateful. Can some of their regex be extracted or merged? Does it make sense to extract domain?, email_address?, ip_range? and use these instead? In case a e.g. check for a domain is used multiple times (looks as if), one could save the returned boolean as instance variable so the token is only checked once. That should increase speed.

PS: if ABBREVIATIONS and STOP_WORDS were sets instead of arrays, it should be much faster to check against them.