Closed maia closed 8 years ago
Thanks! I really appreciate it. My only concern is that this seems to slow things down. This spec goes from 25s to 28.5s on my computer: https://github.com/diasks2/pragmatic_tokenizer/blob/master/spec/performance_spec.rb#L35-L47
Readability is important but in my opinion I think performance is possibly more important for this gem as I think a lot of use cases will be processing large amounts of text. So for changes that don't change the underlying functionality but slow things down I have some concern.
What are your thoughts?
I'm unsure what the reason for this is - either it's the procs (possible, but I doubt it), or it's the change of various regex. E.g. I went from:
flat_map { |t| (t[0] == '‚' || t[0] == ',') && t.length > 1 ? t.split(/(,|‚)/).flatten : t }
to:
flat_map { |t| t.split(/^(,|‚)+/) }
While I believe the latter is the better solution, it will increase the necessity of optimized regex. E.g. quite a few methods would not even need to run if the token length is 1, and checking for string length might be much faster than running a regex.
What I've not tried yet is using Regexp#union
to merge multiple regex into one larger regex and using this for token.split
. I guess it could help a lot, I'll give it a shot.
But besides the speed issue (which we will tackle in the upcoming days and weeks) I strongly think that such a library should be maintainable. The original code was so complicated that even you might not be able to easily change/fix/add logic in a few months from now, and certainly noone else will. I'd therefor suggest to keep the code as understandable as possible, as it will allow other contributors to help.
Thanks, makes sense. Once we get the refactoring done let's focus on improving the speed. Thanks for all your help!
I tried to clean up the naming some: https://github.com/diasks2/pragmatic_tokenizer/commit/06bb017dd231365f97df77622f67dfdc8721f605
Great, that makes it a bit more understandable. I'm still unsure about how post_process
, method_name3
, unified1
and full_stop_separated_tokens
work together, can these be cleaned up or rewritten? Due to the calls to external classes one has to jump around quite a bit between these methods.
Oh, and do you manage to remove the if/else ternary from split_emoji
(and then move it to the regex constants)? I failed when trying so. And: if you could clean up unknown_period1?
and unknown_period2?
, I'd be grateful. Can some of their regex be extracted or merged? Does it make sense to extract domain?
, email_address?
, ip_range?
and use these instead? In case a e.g. check for a domain is used multiple times (looks as if), one could save the returned boolean as instance variable so the token is only checked once. That should increase speed.
PS: if ABBREVIATIONS
and STOP_WORDS
were sets instead of arrays, it should be much faster to check against them.
This is a major refactoring of PostProcessor. Some method names need to be replaced, I'm unsure what they currently do, and others could probably be cleaned up (
method_name18
andmethod_name19
). I've also replaced regex checks by direct calls to.split
(other thanmethod_name13
, I couldn't get that to work). Finally, instead of chainingflat_map
it will now reduce over procs. This looks scary initially, but one can get used to it easily.Please check if the two calls of reducing over procs can be merged - the logic of
post_process
,method_name3
,full_stop_separated_tokens
andmethod_name16
confuses me, and I'm not sure if there's a specific order or if everything could be run at once.All tests pass.