Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.61k stars 587 forks source link

Warn when "run()" doesn't appear in mr script #592

Open kevinburke opened 11 years ago

kevinburke commented 11 years ago

Hey, just wanted to share an experience I had trying to run a MRJob from another script.

The main.py script looked something like this:

import bagcheck

def main():
    # I thought I could specify S3 urls in the args list, turns out you can't
    mr_job = bagcheck.Bagcheck(args=['-r', 'emr'])  
    with mr_job.make_runner() as runner:
        runner.run()

if __name__ == "__main__":
    main()

Then the actual job looks something like this:

from mrjob.job import MRJob

class Bagcheck(MRJob):

    def emr_job_runner_kwargs(self):
        return {
            'hadoop_version': '0.20',
            'mr_job_script': 'bagcheck.py',
            'cmdenv': {
                'REALM': os.getenv('REALM')
            },
        }

    def mapper(self, _, line):
        # do some stuff

    def reducer(self, key, lines):
        # do some stuff

First the script just waited forever for input, until (I think) I remembered to echo an s3 url and pipe it to python.

Then I kept getting a "step description is empty!" message. I tried redefining steps() in the Bagcheck class, but that didn't do anything. Eventually I realized I was missing the

if __name__ == "__main__":
    Bagcheck.run()

lines at the bottom of bagcheck.py.

What's the lesson or the improvement to be made? I'm not sure. I wanted to run the mrjob from another Python script to avoid piping over stdout to a separate script, but it appears MRJob is set up much better for the 'streaming-over-stdout' use case.

It also appears running MRJob from a separate script is swallowing the usual stderr from MRJob, which is why calling main.py without s3 urls just waited forever without doing anything or echoing anything. I'm trying to figure out how to add a verbose flag to the separate script runner now.

sirpengi commented 11 years ago

What do you mean by "I thought I could specify S3 urls in the args list, turns out you can't". I'm working with mrjob 0.3.5 and I've got some extensive tooling written that handles everything within python code, and my jobs are passing upwards of 20+ s3 urls as the input for any one job. Assuming you're not using 0.4dev and it's not a regress there, can you retry and report back with whatever issue you've encountered?

coyotemarin commented 11 years ago

Changing this ticket to checking the mr_*.py script for the string run() and issuing a warning if it's missing, which would have caught your mistake.