JulesKouatchou / basic_language_comparison

Other
86 stars 22 forks source link

Java-Implementation for `testCountUniqueWords` is highly inefficient #15

Open oli-h opened 3 years ago

oli-h commented 3 years ago

You do

File f = new File(filename);
ArrayList arr = new ArrayList();
HashMap<String, Integer> listOfWords = new HashMap<String, Integer>(); 
Scanner in = new Scanner(f);
int i = 0;
while(in.hasNext()) {
  String s = in.next();
  arr.add(s.replaceAll("[^a-zA-Z]", "").toLowerCase());
}
Iterator itr=arr.iterator();
while(itr.hasNext()) {
  i++;
  listOfWords.put((String) itr.next(), i);
}
Set<Object> uniqueValues = new HashSet<Object>(listOfWords.values()); 

This is inefficient in multiple aspects:

  1. The String.replaceAll("[^a-zA-Z]", ...) parses and compiles the RegEx-Patter [^a-zA-Z] over and over again. You should use once a Pattern.compile("[^a-zA-Z]") before the loop and the reuse it within the loop
  2. You first apply the RegEx (to remove non-alpha-chars) and then apply the toLowerCase. It's better to first apply the lowercase and then have a simpler RegEx, namely [^a-z] as you will not find UPPERCASE-Letters any more.
  3. You first collect all words in an ArrayList, then loop over this ArrayList to fill a HashMap (where the unique words are the key) and finally you create a third collection (the HashSet). This all makes not much sense - you can directy fill a one-and-only HashSet in the first loop.

So make this (only the core part here):

HashSet<String> uniqueValues = new HashSet<>();
Pattern pattern = Pattern.compile("[^a-zA-Z]");
while (in.hasNext()) {
  String s = in.next();
  s = pattern.matcher(s.toLowerCase()).replaceAll("");
  uniqueValues.add(s);
}

This is ~3 times faster (after JVM-warm-up) than your implementation - with much less and simpler code

Note that also have such an optimal implementation for Python: https://github.com/JulesKouatchou/basic_language_comparison/blob/master/Python/test_count_unique_words.py#L29-L31

JulesKouatchou commented 2 years ago

Thank you for your suggestion. Java is not the primary language. I tend to translate the code from another language into Java. I will include your changes in the future release.