Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

code breaks locally but runs fine remotely on hadoop cluster #2211

Closed my-umd closed 2 years ago

my-umd commented 3 years ago

(I couldn't find anything related after an intensive web search) I am facing a very confusing issue. I believe the issue started from mrjob v0.6.8 and persists till the latest version. Here is the sample code: ` from mrjob.job import MRJob from mrjob.step import MRStep import sys

class MRSp(MRJob):

def init_mr(self):
    sys.stderr.write('Processing file.\n')

def mapper(self, _, line):

    yield 1, 1

def print_header(self):
    header = 'test'
    print(header)

def steps(self):

    return [
        MRStep(mapper_init=self.init_mr,
               mapper=self.mapper,
               reducer_init=self.print_header,
               ),
    ]

if name == 'main':

MRSp.run()

` (Note: don't know why underscores are stripped from above main function declaration) To run the code, need to create a local directory (e.g., test_input) and put a random small text file in it. When running from a (CentOS) Linux shell in a (Python 3.8.9) virtualenv that has mrjob 0.6.7 installed, it runs fine. However, the code crashes with the following exception when running in a virtualenv that has mrjob 0.6.8 (and beyond) installed (the command line: python test.py test_input): File "test.bytes.py", line 8, in init_mergedrs sys.stderr.write('Processing file.\n') TypeError: a bytes-like object is required, not 'str'

If I comment out the sys.stderr.write (putting a 'pass' statement), the code still crashes locally, but the offending line is now in the 'print(header)' statement (same exception).

The code runs fine remotely on hadoop cluster though (with either mrjob 0.6.7 or 0.6.8 and beyond). Checking v0.6.8 change log doesn't reveal anything that gives any hint. Can anybody help? Thanks. (The issue also happens in Python 3.7 and 3.9).

robinsonkwame commented 2 years ago

@my-umd did you resolve this? This may be helpful

my-umd commented 2 years ago

Thanks @robinsonkwame. It turned out that I can't use print anymore. I have to use self.stdout.write.