basho / riak-python-client

The Riak client for Python.
Apache License 2.0
320 stars 183 forks source link

Inconsistencies of MapReduce when using http or pbc [JIRA: CLIENTS-131] #394

Open gglanzani opened 9 years ago

gglanzani commented 9 years ago

I have the following Erlang code (very simple, just for illustration)

-module(grid_mr).

-export([yearfun/3]).

yearfun(O, _KeyData, _Arg) ->
  {struct, Map} = mochijson2:decode(riak_object:get_value(O)),
  Year = proplists:get_value(<<"year">>, Map, -1.0),
  Grid = proplists:get_value(<<"grid">>, Map, -1.0),
  case Year > 2006 of
     true -> [{Grid, Year}];
     false -> []
end.

When I run a MR code via CURL, I get

$ curl -XPOST http://172.17.12.21:8089/mapred \
   -H 'Content-Type: application/json'   \
   -d '{"inputs":[["STATS", grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2008"],
["STATS",  grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2010"]],"query":[{"map":{"language":"erlang","module":"grid_mr","function":"yearfun"}}]}'

{"C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0":2008,"C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0":2010}

However when submitting with Python using http

from riak import RiakClient
from riak import RiakMapReduce
riak = RiakClient(protocol='http', host='172.17.12.22', http_port=8089)
bucket = riak.bucket("STATS")
mr = RiakMapReduce(riak)
keys = ["grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2008", 
        "grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2010"]
mr.add("STATS", keys)
#mr.search(index="grid_stats", query="industry_id:22 AND customer_segmentation_id:6")
mr.map(['grid_mr', 'yearfun'])
for result in mr.run():
    print "%s" % result 

I get

C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0

If, instead, I use pbc

from riak import RiakClient
from riak import RiakMapReduce
riak = RiakClient(protocol='pbc', host='172.17.12.22', http_port=8087)
bucket = riak.bucket("STATS")
mr = RiakMapReduce(riak)
keys = ["grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2008", 
        "grid_stats_C7FF2796D6BD1153E847E84277F6B4A1022E29DACA35989004C10FFA92E2A5F0-2010"]
mr.add("STATS", keys)
#mr.search(index="grid_stats", query="industry_id:22 AND customer_segmentation_id:6")
mr.map(['grid_mr', 'yearfun'])
for result in mr.run():
    print "%s" % result

the client raises an exception

[...]
.virtualenvs/numpy/lib/python2.7/site-packages/riak/transports/pbc/transport.pyc in mapred(self, inputs, query, timeout)
    410         for phase, content in self.stream_mapred(inputs, query, timeout):
    411             if phase in result:
--> 412                 result[phase] += content
    413             else:
    414                 result[phase] = content
TypeError: unsupported operand type(s) for +=: 'dict' and 'dict'

Am I doing something wrong, or…? Right now changing the Erlang code is the only fix:

-module(binary_grid).

-export([yearfun/3]).

yearfun(O, _KeyData, _Arg) ->
  {struct, Map} = mochijson2:decode(riak_object:get_value(O)),
  Year = proplists:get_value(<<"year">>, Map, -1.0),
  Grid = proplists:get_value(<<"grid">>, Map, -1.0),
  case Year > 2006 of
     true -> [list_to_binary(mochijson2:encode([{Grid, Year}]))];
     false -> []
end.
hazen commented 9 years ago

@gglanzani We'll take a look at this. I assume it's not related to the mochijson2 issue you reported in https://github.com/basho/riak-python-client/issues/395?

gglanzani commented 9 years ago

No, that issue seems to be related with an older version of mochijson2, while this one present itself only when I'm NOT using mochi

hazen commented 9 years ago

@gglanzani Sorry for the delay. Trying to understand your situation. So you are running your own, custom Erlang yearfun on your nodes. You are getting the desired results in curl, but are only getting a partial result with the Python HTTP interface, right? And PBC seems to raise an exception in this case, right?

I'm looking at our current test cases and all of them seem to return a list values, not more complicated types like proplists. I could see that curl is simply blurping back whatever Erlang returns, but the Python client has to serialize and deserialize. As you can see from the PBC results, it's trying to convert it into a Python dict(), which is the source of the exception. I'm interested that the HTTP sort of works. I'm guessing it is unpacking part of the JSON we get back from Riak and throwing the rest on the floor.

So my guess at the moment is the Python client is only supporting returning a list of values. We could look at lists of dicts() as an enhancement.

gglanzani commented 9 years ago

Yes,you gave an accurate description of what (I think) is going on.

I think lists of dicts were fairly common but I see that some extra code is needed to make it work. Btw, curl converts [{grid1, year1}, {grid2 , year2}] to {grid1: year1, grid2: year2}, so some conversion is happening.

Is this difficult to fix?

hazen commented 9 years ago

@gglanzani I'll take a good to see how difficult it would be to add. Probably not too hard. Just have to prioritize it.

gglanzani commented 9 years ago

@javajolt At least it's in jira :stuck_out_tongue_winking_eye: