Test and profile implementation using fnmatch

ArneBachmann commented 6 years ago

Went from 5 to 13 LOC and has more complex data structures now... If we don't see a a big performance gain, then we should keep it as is. Will post the code and test results here.

ArneBachmann commented 6 years ago

When looking just at positive samples, I see a speedup of 120-200% (up to factor 2). But when looking at the (many more) negative samples from the test suite (which are really micro-benchmarks on mini datasets), I see similar speeddowns but also a wider range of performance losses (factor 1.5x-10x).

This has nothing to say about real-life speeds though, which are already quite OK, plus --only and --except are probably not the most-often used options (?).

ArneBachmann commented 6 years ago

In a real larger repository I could see only performance losses, probably due to the creation of more complex data structures. Here's the profiling code:

import random, subprocess

poss = []
negs = []
for i in range(100):
  folder = ""
  folder2 = ""

  for p in range(random.randint(3, 9) // 3):
    folder += ("/" if p > 0 else "") + "%s" % chr(random.randint(65, 85)) * random.randint(1, 2) + chr(random.randint(65, 85)) * random.randint(1, 2)
  for p in range(random.randint(1, 3)):
    folder2 += ("/" if p > 0 else "") + "%s" % chr(random.randint(65, 85)) * random.randint(1, 2) + chr(random.randint(65, 85)) * random.randint(1, 2)

  file =   "?%s*" % chr(random.randint(65, 85)) * random.randint(1, 2) + chr(random.randint(65, 85)) * random.randint(1, 2)
  file2 =   "?%s*" % chr(random.randint(65, 85)) * random.randint(1, 2) + chr(random.randint(65, 85)) * random.randint(1, 2)
#  print('sos diff --only "%s/%s" --except "%s/%s"' % (folder, file, folder2, file2))
  out = subprocess.Popen('sos diff %s%s' % (('--only "%s/%s" ' % (folder, file)) if random.randint(1, 2) == 1 else "", ('--except "%s/%s"' % (folder2, file2)) if random.randint(1, 2) == 1 else ""), bufsize = 1, shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT).communicate()[0].decode("utf-8")
  if "=======" in out:
    poss += [float(o.split(" ")[1]) for o in out.replace("\r", "\n").split("\n") if o.startswith("======")]
    print("+%f" % poss[-1])
  elif "-----" in out:
    negs += [float(o.split(" ")[1]) for o in out.replace("\r", "\n").split("\n") if o.startswith("------")]
    print("-%f" % negs[-1])

if poss:
  print("Pos num/med/mean")
  print(len(poss))
  print(sorted(poss)[int(len(poss)) // 2])
  print(sum(poss)/len(poss))
if negs:
  print("Neg num/med/mean")
  print(len(negs))
  print(sorted(negs)[int(len(poss)) // 2])
  print(sum(negs)/len(negs))

and here is the supposedly faster fnmatch.filter code:

    byFolder:Dict[str,List[str]] = collections.defaultdict(list)
    onlyByFolder:Dict[str,List[str]] = collections.defaultdict(list)
    dontByFolder:Dict[str,List[str]] = collections.defaultdict(list)
    for path, pinfo in ((_path, _pinfo) for _path, _pinfo in _.paths.items() if _pinfo is not None):
      byFolder[path[:path.rindex(SLASH)]].append(path[path.rindex(SLASH) + 1:])
    for pattern in considerOnly ?? []: onlyByFolder[pattern[:pattern.rindex(SLASH)]].append(pattern[pattern.rindex(SLASH) + 1:])
    for pattern in dontConsider ?? []: dontByFolder[pattern[:pattern.rindex(SLASH)]].append(pattern[pattern.rindex(SLASH) + 1:])
    for folder, paths in byFolder.items():
      pos:Set[str] = set.union(set(paths), *[fnmatch.filter(paths, pattern) for pattern in onlyByFolder.get(folder, [])]) if onlyByFolder is not None else paths
      neg:Set[str] = set.union(set(paths), *[fnmatch.filter(paths, pattern) for pattern in dontByFolder.get(folder, [])]) if dontByFolder is not None else set()
      knownPathz[folder].extend(list(pos - neg))

ArneBachmann commented 6 years ago

Some optimizations:

Remove the generator expression and use continue for the None checks
We checked the wrong variable for None
We initialized the sets with the wrong list instead of empty
Initialize as normal dict and don't use extend
Introduce assertion to compare old and new approach

Now we have consistenly speedups in real life and only a minority of speeddowns in the test suite.

ArneBachmann commented 6 years ago

Full profiling code:

    _old:float = time.time()
    for path, pinfo in _.paths.items():  # TODO this approach sometimes leaves keys with emoty lists?
      if pinfo.size is not None\
      and (considerOnly is None or     any(path[:path.rindex(SLASH)] == pattern[:pattern.rindex(SLASH)] and fnmatch.fnmatch(path[path.rindex(SLASH) + 1:], pattern[pattern.rindex(SLASH) + 1:]) for pattern in considerOnly))\
      and (dontConsider is None or not any(path[:path.rindex(SLASH)] == pattern[:pattern.rindex(SLASH)] and fnmatch.fnmatch(path[path.rindex(SLASH) + 1:], pattern[pattern.rindex(SLASH) + 1:]) for pattern in dontConsider)):
        knownPaths[os.path.dirname(path)].append(os.path.basename(path))  # TODO #249 reimplement using fnmatch.filter and set operations for all files per path for speed
    _new:float = time.time()

    _old = _new - _old  # compute old strategy
    byFolder:Dict[str,List[str]] = collections.defaultdict(list)
    onlyByFolder:Dict[str,List[str]] = collections.defaultdict(list)
    dontByFolder:Dict[str,List[str]] = collections.defaultdict(list)
    for path, pinfo in _.paths.items():
      if pinfo is None: continue  # quicker than generator expression above
      byFolder[path[:path.rindex(SLASH)]].append(path[path.rindex(SLASH) + 1:])
    for pattern in considerOnly ?? []: onlyByFolder[pattern[:pattern.rindex(SLASH)]].append(pattern[pattern.rindex(SLASH) + 1:])
    for pattern in dontConsider ?? []: dontByFolder[pattern[:pattern.rindex(SLASH)]].append(pattern[pattern.rindex(SLASH) + 1:])
    for folder, paths in byFolder.items():
      pos:Set[str] = set.union(set(), *[fnmatch.filter(paths, pattern) for pattern in onlyByFolder.get(folder, [])]) if considerOnly is not None else set(paths)
      neg:Set[str] = set.union(set(), *[fnmatch.filter(paths, pattern) for pattern in dontByFolder.get(folder, [])]) if dontConsider is not None else set()
      knownPathz[folder] = list(pos - neg)
    _new = time.time() - _new
    if _new < _old: print("%s %.2f" % ("#" * 80, _old * 100. / _new))  # was faster!
    else:           print("%s %.2f" % ("-" * 80, _new * 100. / _old))
    assert {k: sorted(v) for k, v in knownPaths.items()} == {k: sorted(v) for k, v in knownPathz.items()}

ArneBachmann commented 6 years ago

Now usually 2x faster than before.

ArneBachmann / sos

Test and profile implementation using fnmatch #249