eecs485staff / madoop

A light weight MapReduce framework for education
MIT License
9 stars 4 forks source link

Proposal: Add support for custom partitioner #71

Closed noah-weingarden closed 3 months ago

noah-weingarden commented 7 months ago

This PR is a proposal for adding support for a custom partitioner, which emulates Hadoop's Partitioner class. This would allow the project 5 inverted index to be segmented into precisely num_reducers partitions.

Example usage:

$ madoop \
  -input example/input \
  -output example/output \
  -mapper example/map.py \
  -reducer example/reduce.py  \
  -partitioner example/partition.py \
  -numReduceTasks 2

where example/partition.py contains:

#!/usr/bin/env -S python3 -u
"""Word count partitioner."""
import sys

num_reducers = int(sys.argv[1])

for line in sys.stdin:
    key, value = line.split("\t")
    if key[0] <= "G":
        print(0 % num_reducers)
    else:
        print(1 % num_reducers)

All lines with a word whose first letter is alphabetically at or before "G" will end up in part-00000, while all lines with a word whose first letter is alphabetically after "G" will end up in part-00001.

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 97.67442% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 96.34%. Comparing base (44326de) to head (9fdb01f).

Files Patch % Lines
madoop/mapreduce.py 97.61% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #71 +/- ## =========================================== + Coverage 96.31% 96.34% +0.02% =========================================== Files 4 4 Lines 217 246 +29 =========================================== + Hits 209 237 +28 - Misses 8 9 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

noah-weingarden commented 7 months ago

I'll add documentation to the README and the streaming tutorial if this gets approval to move forward.

awdeorio commented 4 months ago

Yes, I think this is the way to go to support EECS 485 Project 5.

noah-weingarden commented 3 months ago

WDYT about including an additional provided partitioner that ships with Madoop and is compatible with Hadoop, like -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner? Would that help with our P5 problem?

We could do that, although it looks like https://github.com/eecs485staff/p5-search-engine/pull/710 already uses this to solve the P5 problem as-is. Let me know if you still want this feature anyway.

awdeorio commented 3 months ago

WDYT about including an additional provided partitioner that ships with Madoop and is compatible with Hadoop, like -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner? Would that help with our P5 problem?

. We could do that, although it looks like https://github.com/eecs485staff/p5-search-engine/pull/710 already uses this to solve the P5 problem as-is. Let me know if you still want this feature anyway.

Moving this discussion to https://github.com/eecs485staff/p5-search-engine/issues/714

awdeorio commented 3 months ago

I'm merging to get this going for W24 P5