Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

ignore unrecognized arguments #2210

Closed dhuy237 closed 3 years ago

dhuy237 commented 3 years ago

Normally, if I want to define a command-line option for mrjob, I have to do like this:

class Calculate(MRJob):
    def configure_args(self):
        super(Calculate, self).configure_args()
        self.add_passthru_arg("-t", "--time", help="output folder for time")

When I want to use the argument, I just need to call self.options.time but it only works inside that class.

I want to track the time of the mrjob and write the time into the json file like this cal.py:

from datetime import datetime
import json
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--time", help = "Output file")
args = parser.parse_args()

class Calculate(MRJob):
    ...

start_time = datetime.now()
Calculate.run()
execute_time = (datetime.now() - start_time).total_seconds()

data = {}
data["step1"] = execute_time
with open(args.time+'/time.json', 'w') as outfile:
    json.dump(data, outfile)

When I run with this command:

python cal.py data/input/input.txt --output data/output --time data/output

I got this error:

usage: cal.py [-h] [-t TIME]
cal.py: error: unrecognized arguments: data/input/input.txt --output data/output

Then I found an answer about using parse_known_args() and I tried it:

args, unknown = parser.parse_known_args()

Now, I got a new error. I believe this is from mrjob because I did remove the argparse and run the same command, it returns exactly like this:

usage: cal.py [options] [input files]
cal.py: error: unrecognized arguments: --time data/output

How can I define an argument without affecting the mrjob class?

dhuy237 commented 3 years ago

I found a workaround solution but I hope there will be a better way of doing this.

I have to define the argument again inside the mrjob class so it can recognize it:

from datetime import datetime
import json
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--time", help = "Output file")
# Use parse_known_args() to ignore all the arguments for mrjob class
args, unknown = parser.parse_known_args()

class Calculate(MRJob):
    def configure_args(self):
        super(Calculate, self).configure_args()
        # Define the argument again to ignore the argparse when running the MR job
        self.add_passthru_arg("-t", "--time", help="output folder for time")

start_time = datetime.now()
Calculate.run()
execute_time = (datetime.now() - start_time).total_seconds()

data = {}
data["step1"] = execute_time
with open(args.time+'/time.json', 'w') as outfile:
    json.dump(data, outfile)

And run with this command:

python cal.py data/input/input.txt --output data/output --time data/output

My question on Stack Overflow