GoogleCloudPlatform / appengine-mapreduce

A library for running MapReduce jobs on App Engine
https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/1-MapReduce
Apache License 2.0
234 stars 109 forks source link

RequestTooLargeError caused mapreduce to abort #85

Open waleedka opened 8 years ago

waleedka commented 8 years ago

We started a mapreduce job to group&copy some values from one table to another. The map and reduces phases are very simple, but the source table is with 500 million rows. The pipeline is a simple map/reduce

  def run(self):
    yield mapreduce_pipeline.MapreducePipeline(
      "populateshortlinkblocks",
      "tasks.main.map_populate_short_link_blocks",
      "tasks.main.reduce_populate_short_link_blocks",
      "mapreduce.input_readers.DatastoreInputReader",
      mapper_params={
        "entity_kind": 'api.post.Post',
      },
      shards=256)

The map, shuffle, and shuffle-sort phases finished (costing over $500). Then the pipeline aborted for no obvious reason. We're hoping to find a way to resume from where we stopped because we don't want to start over and incur the same cost again. This is the error that broke the pipeline. Seems to be a bug in starting the merge phase.

E 2016-01-26 12:33:11.656  200      84 B 49.05 s D 12:33:17.613 E 12:33:39.697 W 12:34:00.009 /mapreduce/pipeline/run
  0.1.0.2 - - [26/Jan/2016:12:33:11 -0800] "POST /mapreduce/pipeline/run HTTP/1.1" 200 84 http://live.networkedblogshr.appspot.com/mapreduce/pipeline/run "AppEngine-Google; (+http://code.google.com/appengine)" "live.networkedblogshr.appspot.com" ms=49052 cpu_ms=7914 cpm_usd=9.387e-06 instance=00c61b117ccb1bc29dba9a1b1318d55b1028576e app_engine_release=1.9.31 trace_id=-
    D 12:33:17.613 Running mapreduce.mapper_pipeline.MapperPipeline(*(u'populateshortlinkblocks-shuffle-merge', u'mapreduce.shuffler._merge_map', u'mapreduce.shuffler._MergingReader'), **{'output_writer_spec': u'mapreduce.output_writers._GoogleCloudStorageRecordOutputWriter', 'params': {u'files': [[u'/networkedblogshr.appspot.com/populateshortlinkblocks-shuffle-sort-0/157260387977788B... (2665890 bytes))#582f986f0a1240328bb363a4cec1b3eb
    E 12:33:39.697 Generator mapreduce.mapper_pipeline.MapperPipeline(*(u'populateshortlinkblocks-shuffle-merge', u'mapreduce.shuffler._merge_map', u'mapreduce.shuffler._MergingReader'), **{'output_writer_spec': u'mapreduce.output_writers._GoogleCloudStorageRecordOutputWriter', 'params': {u'files': [[u'/networkedblogshr.appspot.com/populateshortlinkblocks-shuffle-sort-0/157260387977788B... (2665890 bytes))#582f986f0a1240328bb363a4cec1b3eb raised exception. RequestTooLargeError: The request to API call datastore_v3.Put() was too large.
      Traceback (most recent call last):
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/pipeline/pipeline.py", line 2144, in evaluate
          self, pipeline_key, root_pipeline_key, caller_output)
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/pipeline/pipeline.py", line 1110, in _run_internal
          return self.run(*self.args, **self.kwargs)
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/mapreduce/mapper_pipeline.py", line 98, in run
          queue_name=self.queue_name,
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/mapreduce/control.py", line 125, in start_map
          in_xg_transaction=in_xg_transaction)
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/mapreduce/handlers.py", line 1761, in _start_map
          _txn()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2732, in inner_wrapper
          return RunInTransactionOptions(options, func, *args, **kwds)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2630, in RunInTransactionOptions
          ok, result = _DoOneTry(function, args, kwargs)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2650, in _DoOneTry
          result = function(*args, **kwargs)
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/mapreduce/handlers.py", line 1758, in _txn
          cls._create_and_save_state(mapreduce_spec, _app)
        File "/base/data/home/apps/s~networkedblogshr/live.390252708758560814/mapreduce/handlers.py", line 1785, in _create_and_save_state
          state.put(config=config)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1077, in put
          return datastore.Put(self._entity, **kwargs)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 605, in Put
          return PutAsync(entities, **kwargs).get_result()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result
          return self.__get_result_hook(self)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1881, in __put_hook
          self.check_rpc_success(rpc)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1371, in check_rpc_success
          rpc.check_success()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 579, in check_success
          self.__rpc.CheckSuccess()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 134, in CheckSuccess
          raise self.exception
      RequestTooLargeError: The request to API call datastore_v3.Put() was too large.
    W 12:34:00.009 Giving up on pipeline ID "582f986f0a1240328bb363a4cec1b3eb" after 3 attempt(s); causing abort all the way to the root pipeline ID "334b2db5ec964b8c98b33bd29a210660"

Any suggestions on how to hack it to resume from where it stopped? By the way, the /abort handler also failed with another error but I'm guessing that's a side effect of this error.