File f = new File(filename);
ArrayList arr = new ArrayList();
HashMap<String, Integer> listOfWords = new HashMap<String, Integer>();
Scanner in = new Scanner(f);
int i = 0;
while(in.hasNext()) {
String s = in.next();
arr.add(s.replaceAll("[^a-zA-Z]", "").toLowerCase());
}
Iterator itr=arr.iterator();
while(itr.hasNext()) {
i++;
listOfWords.put((String) itr.next(), i);
}
Set<Object> uniqueValues = new HashSet<Object>(listOfWords.values());
This is inefficient in multiple aspects:
The String.replaceAll("[^a-zA-Z]", ...) parses and compiles the RegEx-Patter [^a-zA-Z] over and over again. You should use once a Pattern.compile("[^a-zA-Z]") before the loop and the reuse it within the loop
You first apply the RegEx (to remove non-alpha-chars) and then apply the toLowerCase. It's better to first apply the lowercase and then have a simpler RegEx, namely [^a-z] as you will not find UPPERCASE-Letters any more.
You first collect all words in an ArrayList, then loop over this ArrayList to fill a HashMap (where the unique words are the key) and finally you create a third collection (the HashSet). This all makes not much sense - you can directy fill a one-and-only HashSet in the first loop.
So make this (only the core part here):
HashSet<String> uniqueValues = new HashSet<>();
Pattern pattern = Pattern.compile("[^a-zA-Z]");
while (in.hasNext()) {
String s = in.next();
s = pattern.matcher(s.toLowerCase()).replaceAll("");
uniqueValues.add(s);
}
This is ~3 times faster (after JVM-warm-up) than your implementation - with much less and simpler code
Thank you for your suggestion. Java is not the primary language. I tend to translate the code from another language into Java. I will include your changes in the future release.
You do
This is inefficient in multiple aspects:
String.replaceAll("[^a-zA-Z]", ...)
parses and compiles the RegEx-Patter[^a-zA-Z]
over and over again. You should use once aPattern.compile("[^a-zA-Z]")
before the loop and the reuse it within the loop[^a-z]
as you will not find UPPERCASE-Letters any more.So make this (only the core part here):
This is ~3 times faster (after JVM-warm-up) than your implementation - with much less and simpler code
Note that also have such an optimal implementation for Python: https://github.com/JulesKouatchou/basic_language_comparison/blob/master/Python/test_count_unique_words.py#L29-L31