Insanely slow performance on text with no whitespace

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Modify the following test case to match paths on your system (StringUtils 
comes from Commons Lang, in case you're not using it.  I was reducing the size 
of the test.)

    @Test
    public void test() throws Exception {
        // Generate two strings which are exactly the same length (thus we should expect similar performance.)
        int testSize = 8*1024;
        String repetitiveEnglish = StringUtils.repeat("I see what you did there, dude. ", testSize/32);
        String noSpaces = StringUtils.repeat("abcdefgh", testSize/8);

        DetectorFactory.loadProfile(new File("dependencies/langdetect/profiles"));

        detect(repetitiveEnglish, 10);
        detect(noSpaces, 10);
        detect(repetitiveEnglish, 10);
        detect(noSpaces, 10);
    }

    private void detect(String text, int runs) throws Exception
    {
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < runs; i++)
        {
            Detector detector = DetectorFactory.create();
            detector.append(text);
            String result = detector.detect();
        }
        long t1 = System.currentTimeMillis();
        System.out.println(String.format("%s ms", t1-t0));
    }

2. Run and watch the output.

I get the following:

219 ms
9066 ms
36 ms
8999 ms

So whereas normal English-like text gets down to 3.6ms or so per detection, 
detecting the language of text with no whitespace costs nearly a second per go. 
 When you're trying to identify millions of documents, that adds up pretty fast.

Original issue reported on code.google.com by dan...@nuix.com on 17 Oct 2011 at 5:55

GoogleCodeExporter commented 9 years ago

I guess it is probably not because of no space text but 'unknown language' text.
langdetect tend to stop detection processes when conversing its probabilities 
of languages. So easy text detection is earlier than not easy one.

Original comment by nakatani.shuyo on 17 Oct 2011 at 8:28

GoogleCodeExporter commented 9 years ago

I changed the mail regex a bit and it improved numbers a bit:

    //private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
    private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,64}@([-_0-9A-Za-z]){1,63}(.([-_.0-9A-Za-z]{1,63}))");

New timings:

36 ms
154 ms
19 ms
140 ms

Original comment by dan...@nuix.com on 17 Oct 2011 at 9:41

GoogleCodeExporter commented 9 years ago

Wow, I verified lots improvement by your code! (honestly, I couldn't believe 
it...)
Then I'll modify at your proposal.
Very THANKS!

Original comment by nakatani.shuyo on 18 Oct 2011 at 3:57

GoogleCodeExporter commented 9 years ago

//private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");

URL regex should have also comma.

Original comment by markowsk...@gmail.com on 18 Oct 2011 at 2:00

GoogleCodeExporter commented 9 years ago

Yeah, I didn't want to turn this into a debate about which characters are valid 
in email addresses, because actually there are *quite a few* more than what is 
mentioned here.

Original comment by dan...@nuix.com on 19 Oct 2011 at 1:28

GoogleCodeExporter commented 9 years ago

You started discussion with comment 2 ;) 
All URI valid characters are here: http://www.ietf.org/rfc/rfc3986.txt

Original comment by markowsk...@gmail.com on 19 Oct 2011 at 6:29

GoogleCodeExporter commented 9 years ago

I created Issue 27 to track the comment about URL matching, since this ticket 
is about the performance issue, not correctness.

Original comment by trejkaz on 19 Oct 2011 at 9:50

GoogleCodeExporter commented 9 years ago

Can we issue out a new release?  This is a very important fix for us, and I 
think there have been many other important fixes since the last release.  
Thanks!

Original comment by david.si...@gmail.com on 20 Oct 2011 at 3:24

GoogleCodeExporter commented 9 years ago

I don't release yet, but committed the modified source and the jar file.

http://code.google.com/p/language-detection/source/browse/
http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Flib

Original comment by nakatani.shuyo on 20 Oct 2011 at 5:40

RangerWolf / language-detection

Insanely slow performance on text with no whitespace #26