Closed maxgrenderjones closed 9 years ago
Cool. How long did it take on your computer?
I've got quite a few PRs to go through so I'm all about lightening the load right now :sweat_smile:.
Turns out the biggest issue was the regex - I was trying to do more than simply match a static string. Modified to the below, it takes 13s on a Macbook Pro.
from multiprocessing import Pool, Queue, cpu_count
from collections import Counter
import streamutils as su
import re
KNICKS=re.compile('knicks')
def process(f):
return su.read(fname=f) | su.split(sep='\t') | su.sfilter(lambda x: KNICKS.match(x[3])) | su.smap(lambda x: x[1]) | su.bag()
if __name__=='__main__':
bag=Pool(cpu_count()).map(process, su.find('tmp/tweets/tweets_*')) | su.sreduce(lambda x, y: x+y, Counter())
bag.most_common() | su.smap(lambda x: '%s\t%s\n' % (x[0], x[1])) | su.write('tmp/python_parallelstreamoutput')
For fun I implemented the task using streamutils, a python library I've been working on that makes text processing very quick to write (but single threaded and uses generators, so not particularly fast to run). Makes for pretty short code though - read it like it were some
bash
commands chained together:Alternative version, parallelized using multiprocessing:
Not sure if it's worthy of a pull request though - more just to show that python can be terse too :)