Report histogram of key distributions

xelax commented 11 years ago

A common issue with hadoop jobs is key skew: i.e. the key used in the mapper is skewed, leading to a very uneven distribution of records to the reducers. It would be very helpful if scoobi were to report how many records were send to each reducer as an histogram at the end of each map/reduce task. This will enable debugging job failures and help optimize scoobi programs.

A possible extensions would be to store the key histogram persistently along with the output of the map-reduce job, for example by creating a .histogram directory, containing a separate file for each mapper task which contains a set of pairs (reducer index, record counts).

Such persistent record could eventually be used to re-optimize a scoobi program along the lines proposed by this paper: http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p707-bruno.pdf

etorreborre commented 10 years ago

Hi Alex,

I'm having a look at this and I'm planning to use counters to report the values. For now I can easily count the number of records sent to each reducer, as in this small test case.

When you write "key histogram", what are you after? Do you want the number of values per key?

xelax commented 10 years ago

essentially I am looking for how many records are sent to each mapper/reducer to diagnose key skew, so I think that your counter-based approach is what is needed as a diagnostic tool. Thank you. For optimization purposes what is needed is some persistent record of the key distribution associated with a persisted file, but this is for the future.

etorreborre commented 10 years ago

Please try this. As far as persisting values goes I think it's better to leave this to you on the client side.

NICTA / scoobi

Report histogram of key distributions #297