DaveAKing / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

CharMatcher.ASCII.matchesAllOf sometimes is wrong. #1720

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
After alot of tries, I've realized that if I check the same string in different 
runs, sometimes the method

CharMatcher.ASCII.matchesAllOf(string);

will return the wrong answer, about 97% of the time it returns false, but every 
once in a while it returns as true,
I do not know how to reproduce it, that seems to happen in random times

Original issue reported on code.google.com by eric.itz...@gmail.com on 10 Apr 2014 at 7:34

GoogleCodeExporter commented 9 years ago
Without a test to confirm your claim, this seems to be CANNOT-REPRODUCE. The 
code below passes 100% of times. Do you have at least a sometimes reproducible 
test? Did you catch the data when the CharMatcher failed and did you confirm 
that the data are indeed all ASCII? How does CharMatcher behave when run on the 
same data again and again?

import static org.junit.Assert.assertTrue;

import org.apache.commons.lang3.RandomStringUtils;
import org.junit.Test;

import com.google.common.base.CharMatcher;

public class CharMatcherFlakyTest {
    private static final CharMatcher ASCII_MATCHER = CharMatcher.ASCII;
    private static final int MAX_ITERATONS = 1000000;   // a million

    @Test
    public void asciiMatcherProbabilisticTest() {
        for (int i = 0; i < MAX_ITERATONS; i++) {
            String randomAsciiString = RandomStringUtils.randomAscii(100);
            assertTrue("An ASCII CharMatcher didn't match a random ASCII String \"" + randomAsciiString + "\"",
                    ASCII_MATCHER.matchesAllOf(randomAsciiString));
        }
    }
}

Original comment by JanecekP...@seznam.cz on 10 Apr 2014 at 10:51

GoogleCodeExporter commented 9 years ago
My test was on the same data again, and the string wasn't ascii, it's a string 
in hebrew. the string comes from multiple servers and several databases, so 
each can mess up the string, but it's all the same process so I don't know how 
likley it is...

Original comment by eric.itz...@gmail.com on 10 Apr 2014 at 11:17

GoogleCodeExporter commented 9 years ago
Marking this as invalid. If you can come up with a reproducible test case, 
please reopen.

Original comment by kak@google.com on 10 Apr 2014 at 1:54

GoogleCodeExporter commented 9 years ago
Interesting. I'll try to scrape some hebrew pages and try it out to find 
something reproducible.

Original comment by JanecekP...@seznam.cz on 10 Apr 2014 at 2:44

GoogleCodeExporter commented 9 years ago
It's pretty easy to validate that CharMatcher.ASCII matches only ASCII 
characters, since there's only 128 of them. The logic is incredibly simple too. 
It's basically:

'\0' <= c && c <= '\u007F'

So there's really no way for that to sporadically claim that something that 
isn't ASCII is. Seems pretty clear that you're occasionally getting data that 
is in fact all ASCII. You might want to take a look at what the data is when 
that happens. It could even be all ASCII because of something like bytes being 
decoded in the wrong charset, leading to lots of invalid characters being 
replaced by '?'.

Original comment by cgdecker@google.com on 10 Apr 2014 at 3:44

GoogleCodeExporter commented 9 years ago
Or an occasional empty string in the input...

Original comment by kevinb@google.com on 10 Apr 2014 at 3:55

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:09

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:07