Chalarangelo / 30-seconds-of-python

Short Python code snippets for all your development needs
https://www.30secondsofcode.org/python/p/1
Creative Commons Attribution 4.0 International
8.83k stars 1.26k forks source link

[BUG] is_anagram does not ignore punctuation or special characters and is inefficient #190

Closed mx-psi closed 4 years ago

mx-psi commented 4 years ago

The is_anagram snippet does not ignore punctuation or special characters. Furthermore, the running time could be improved using a collections.Counter.

Expected Snippet Behavior

is_anagram("#anagram", "Nag a ram!") should be True and run in linear time

Current Snippet Behavior

is_anagram("#anagram", "Nag a ram!") is False.

In cases where it does not fail it runs in linearithmic time (because of sorting), instead of linear time.

Possible Solution

One may improve the code as follows:

from collections import Counter

def is_anagram(s1, s2):
  return Counter(
    c.lower() for c in s1 if c.isalnum()
  ) == Counter(
    c.lower() for c in s2 if c.isalnum()
  )

The code counts the alphanumeric characters on each string in their lowercase version and checks if the counters match. There is no need for sorting and the intermediate strings need not be created.

I intend to open a PR for fixing this but I am unsure as to what counts as a special character. In the suggested fix, str.isalnum considers characters like U+2460 '①' to be alphanumeric and thus not a special character. Is that the correct interpretation of "special character"?

Chalarangelo commented 4 years ago

I guess ignoring punctuation and spaces counts. Anything weird such as '①' should not be checked against a simple anagram checker to begin with, so I guess it's ok to fail for cases like these. I'd love to see a PR for this sometime soon. :wink: