funginstitute / patentprocessor

BSD 2-Clause "Simplified" License
68 stars 31 forks source link

parse changes, moving files around for starcluster bulk #55

Closed laironald closed 11 years ago

laironald commented 11 years ago

parse is the script Gabe provided parse_2 is the modified script (plan is to integrate this as parse)

def parse_files(filelist):
    if not filelist:
        return []
    parsed = itertools.imap(extract_xml_strings, filelist)
    return itertools.chain.from_iterable(parsed)

def parse_patents(xmltuples):
    if not xmltuples:
        return []
    return map(parse_patent, xmltuples)

def database_commit(patobjects, commit_frequency=commit_frequency):
    num_objs = len(patobjects)  # compute this once
    for i, patobj in enumerate(patobjects):
        alchemy.add(patobj)
        if commit_frequency and ((i+1) % commit_frequency == 0 or i == num_objs):
            alchemy.commit()
            print " *", (i+1), datetime.datetime.now()
    if not commit_frequency:
        alchemy.commit()

the above 3 functions are consolidated to parse_2 per below reason being, the generator function ends up with memory issues (see Killed statements when executing against full dataset)

parse_2

def parse_files(filelist):
    if not filelist:
        return []
    for filename in filelist:
        print filename
        for i, xmltuple in enumerate(extract_xml_strings(filename)):
            patobj = parse_patent(xmltuple)
            alchemy.add(patobj)
            if commit_frequency and ((i+1) % commit_frequency == 0):
                alchemy.commit()
                print " *", (i+1), datetime.datetime.now()
        alchemy.commit()

BENCHMARKING

I am processing two datasets: 2012/01/03 and 2013/04/16

parse -- MySQL -- Full dataset Mon Aug 19 22:19:50 CDT 2013 Killed Mon Aug 19 22:26:37 CDT 2013

The job does not finish and runs out of memory, essentially "killing itself". This takes approximately 7 minutes to reach this conclusion.


parse_2 -- MySQL -- Full dataset Mon Aug 19 22:46:15 CDT 2013 ... ... Mon Aug 19 23:33:19 CDT 2013

The job takes approximate 45 minutes with two files. 10,000 records or so.


parse -- sqlite - Full dataset Mon Aug 19 23:54:03 CDT 2013 Killed Tue Aug 20 00:00:22 CDT 2013

Similar to MySQL, the job does not complete. Takes a bit less time @ 6mins.


parse_2 -- sqlite -- Full dataset Tue Aug 20 00:03:51 CDT 2013 ... ... Tue Aug 20 00:42:18 CDT 2013

The job takes a bit less than 40 minutes with two files. 10,000 records. Slight improvement over MySQL


parse -- MySQL -- Small dataset Tue Aug 20 00:24:59 CDT 2013 Tue Aug 20 00:25:16 CDT 2013

Takes 17 seconds --- however this has commit issues (it doesn't properly commit)


parse_2 -- MySQL -- Small dataset Tue Aug 20 00:26:25 CDT 2013 XML/out/ipg120103.xml XML/out/ipg130416.xml Tue Aug 20 00:26:39 CDT 2013

Takes 14 seconds --- with proper committing


parse -- sqlite - Small dataset Tue Aug 20 00:23:07 CDT 2013 Tue Aug 20 00:23:17 CDT 2013

Smaller dataset takes 10 seconds However, it appears the commit doesn't properly commit --


parse_2 -- sqlite -- Small dataset Tue Aug 20 00:21:39 CDT 2013 XML/out/ipg120103.xml XML/out/ipg130416.xml Tue Aug 20 00:21:51 CDT 2013

Smaller dataset takes 12 seconds

laironald commented 11 years ago

i git fetch upstream to make sure i captured your claims bit

gtfierro commented 11 years ago

I really like the optimizations! The code might require a little fiddling to get it to work with my local IPcluster stuff, but if it's doing better memory management, then we should see a speed up in that too.

Also, looks like the starclusterr README isn't finished, which might be helpful for if me or someone else wants to run this stuff on EC2.

laironald commented 11 years ago

Whoops. Sorry @ the disorganization re: StarCluster. When I have some breathing room (parsing data is unnecessarily intense haha). I'm really going to start organizing that folder / files. I project that will be in a few days.

On Tue, Aug 20, 2013 at 3:21 PM, Gabe Fierro notifications@github.comwrote:

I really like the optimizations! The code might require a little fiddling to get it to work with my local IPcluster stuff, but if it's doing better memory management, then we should see a speed up in that too.

Also, looks like the starclusterr README isn't finishedhttps://github.com/funginstitute/patentprocessor/tree/sqlalchemy/starcluster, which might be helpful for if me or someone else wants to run this stuff on EC2.

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/pull/55#issuecomment-22970369 .

sent from mobile

gtfierro commented 11 years ago

Sounds great -- thanks!