Yelp / pyleus

Pyleus is a Python framework for developing and launching Storm topologies.
Apache License 2.0
403 stars 107 forks source link

why all tasks attain a same object to excute process #160

Closed meowoodie closed 7 years ago

meowoodie commented 9 years ago

Hi! I have taken several experiments on field grouping and shuffle grouping in the word count example, and I found that no matter what grouping strategy I used the results are always same. Then I print the object.self every time in function "process_tuple", and there are only one object in the whole procedure. Why is it? There are a lot of difference in Java about this kind of situation.

poros commented 9 years ago

It's peculiar that you obtain the same result with both field grouping and shuffle grouping and it's the first time I hear someone saying something similar, to be fair... The grouping part is handled in Java, using the same APIs that any other regular Storm topology is using, so it surprises me a lot that you found any difference with that... How did you check that the results are the same?

Regarding the object.self part, I'm not sure I understood what you are printing... Could you be more precise, please?

meowoodie commented 9 years ago

In my opinion, field grouping should has a several of tasks to excute a piece of same code, but each of the tasks should process different key. So I thought every different task which process different key should be a independent object in python. In detail, I write the MyBolt.self to log file in function "process_tuple", and I found out that every time invoking "process_tuple", it was always a same object id that be wrote down. Can you understand my word? Any instruction will be appreciated.

poros commented 9 years ago

I'm sorry, I'm still a bit confused. Perhaps some code would be helpful?

However, if you are logging self inside process_tuple, you should find in the log a number of different object ids equal to the number of tasks in your topology. Storm won't spawn a new task for each key. In a field grouping, each task is univocally assigned to several keys.

mzbyszynski commented 9 years ago

@meowoodie could you show us the topology yaml you are using?

meowoodie commented 9 years ago

I'm sorry for replying so late. I have found out that it might was my fault on this problem. I just cannot understand how Pyleus make the computation distribute. So I thought it should be multiple task to work on a same job in different places, so that different task should be different address in the memory, and it means they are different object in spite that they have a totally same code. In my previous demo, I print out the object address (print self) every time in next_tuple function of bolts, they are always same addresses. Can you explain that? How did you make multiple tasks running in different places(maybe there are several machines), but still hold a same object. Could it be said that I made something wrong?

mzbyszynski commented 9 years ago

@meowoodie you could try logging the task ids: self.context.get('taskid').

Beyond that, we'd have to see your topology.yaml.