dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
187 stars 33 forks source link

Idiomatic Go code #1

Closed egonelbre closed 9 years ago

egonelbre commented 9 years ago

Here's how you would implement this properly in Go.

egonelbre commented 9 years ago

PS. I didn't test the performance, because I was too lazy to download the tweet dataset.

dimroc commented 9 years ago

Thanks for the suggestions! Code looks great but it seems to be skipping over a step. It foregoes writing the mapped results to a file and does it in memory. This makes the solution inconsistent with the others ones. Is this right?

egonelbre commented 9 years ago

Yes, it skips that step intentionally... because otherwise it wouldn't be an idiomatic solution; it doesn't make sense to write those files in any language. It just doesn't make sense to impose random constraints on a solution when it solves the problem. If for some peculiar reason there is need for such files... it would do the counting and writing the files simultaneously (i.e. read input, count and write files).

dimroc commented 9 years ago

You're making the false assumption that the reducers will be running on the same box as the mappers, thus eliminating the need for mappings to be written to a file. The common way to handle large scale map reduce across all systems and languages is to write to disk, such as hdfs, and bring the code to the data rather than vice versa. If we have hundreds of thousands of input files, we'll break down.

My little test does run on just one computer, but the underlying constraints mimic ones that happen in the real world when spanning dynos on heroku, instances on EC2, or instances on mesosphere. It has nothing to do with what's idiomatic for a particular language.

I definitely appreciate the cleaner Go code and will learn from it, but there's definitely a space for mapping to a file and then reducing from a collection of files in any language. I'll have to revert the commit unless we map to a file :-1: . Even if you disagree with the above points, the benchmark is invalidated by skipping the IO which all the other implementations are doing.

egonelbre commented 9 years ago

The common way to handle large scale map reduce across...

I guess, my problem was that I didn't see this as a large problem. Also I wouldn't write "0", "1" entries, I would do such reductions immediately and write that to the disk.

I already deleted my repo, but the code would look something like:

// queues
filenames := make(chan string, *procs)
filtered := make(chan string, *procs)
results := make(chan map[string]int, *procs)

// find files
go func() {
    filelist, err := ioutil.ReadDir(*input)
    check(err)
    for _, file := range filelist {
        filenames <- filepath.Join(*input, file.Name())
    }
    close(filenames)
}()

// parse and filter
Spawn(*procs, func() {
    for filename := range filenames {
        file, err := os.Open(filename)
        outfilename := filename + ".out"
        out, err := os.Create(outfilename)
        check(err)

        scanner := bufio.NewScanner(file)
        for scanner.Scan() {
            record := strings.Split(scanner.Text(), "\t")
            // id, hood, borough, message
            hood, message := record[1], record[3]
            if reQuery.MatchString(message) {
                out.WriteString(hood + "\t1\n")
            } else {
                out.WriteString(hood + "\t0\n")
            }
        }
        file.Close()
        out.Close()
        filtered <- outfilename
    }
}, func() { close(filtered) })

// reduce
Spawn(*procs, func() {
    count := make(map[string]int)
    for filename := range filtered {
        file, err := os.Open(filename)
        check(err)

        rd := bufio.NewReader(file)
        for {
            line, _, err := rd.ReadLine()
            if err == io.EOF {
                break
            }
            check(err)
            record := strings.Split(string(line), "\t")
            // hood, matches
            hood, matches := record[0], record[1]
            if matches == "1" {
                count[hood]++
            }
        }

        file.Close()
    }
    results <- count
}, func() { close(results) })

Probably in the first span there could be additional "bufio.Writer", I can't remember whether it already did that. Also, I'm not sure whether bufio.Reader or bufio.Scanner was faster.

dimroc commented 9 years ago

Just got the chance to take a look at all this and the code in your comment mostly worked. Thanks again for the implementation. I found the Spawn helper method very interesting and will write up on it.

dimroc commented 8 years ago

Hi @egonelbre, Would you be up to write a quick README.md in the golang/ folder on the implementation details? A lot of it is covered in this discussion.

egonelbre commented 8 years ago

I'm not sure what parts of it? It's pretty straight-forward, except the Spawn func.

dimroc commented 8 years ago

Talking about what makes it idiomatic and the Spawn func would be great.

On Mon, Nov 16, 2015 at 11:24 AM, Egon Elbre notifications@github.com wrote:

I'm not sure what parts of it? It's pretty straight-forward, except the Spawn func.

— Reply to this email directly or view it on GitHub https://github.com/dimroc/etl-language-comparison/pull/1#issuecomment-157087254 .

egonelbre commented 8 years ago

Idiomatic usually means, similar to stdlib. The best way I can describe it is, avoid abstractions, unless proven to be necessary, and avoid abstractions that hide performance, until proven to be necessary. tl;dr; idiomatic == very concrete code.

The Spawn func creates N goroutines and uses atomic-s to also run finishing funcs after all goroutines have terminated. This requires some knowledge how atomics work.

There really isn't much more to explain than already exists in the code.

dimroc commented 8 years ago

There is also the dramatic performance difference between regex and substring check. This discussion alone is enough to drop in the readme, but if you don't feel it's worth the effort you can leave it as is.

On Mon, Nov 16, 2015 at 11:33 AM, Egon Elbre notifications@github.com wrote:

Idiomatic usually means, similar to stdlib. The best way I can describe it is, avoid abstractions, unless proven to be necessary, and avoid abstractions that hide performance, until proven to be necessary. tl;dr; idiomatic == very concrete code.

The Spawn func creates N goroutines and uses atomic-s to also run finishing funcs after all goroutines have terminated. This requires some knowledge how atomics work.

There really isn't much more to explain than already exists in the code.

— Reply to this email directly or view it on GitHub https://github.com/dimroc/etl-language-comparison/pull/1#issuecomment-157089909 .

egonelbre commented 8 years ago

I'm not sure what's going on there... I haven't monitored the perf. I can take a quick look tomorrow. Although, the code hasn't been written as high-performance. It's written to be nice and readable.

As why the regexp isn't as fast, I can only guess ATM:

  1. it's hitting some corner case, due to the (?i)
  2. It's due to golang regexp-s linear-time guarantee https://swtch.com/~rsc/regexp/regexp1.html, https://golang.org/pkg/regexp/. It makes some trade-offs with regards to worst-case scenario.
  3. (?i) is not doing the same as strings.ToLower check, e.g. it might have more extensive unicode handling. (although very unlikely)
egonelbre commented 8 years ago

I created https://github.com/golang/go/issues/13288 for this. It is doing some interesting case-folding, and I'm not sure why.

dimroc commented 8 years ago

This is fantastic. It would be great to get a PR into golang as a result of your analysis. I'll keep tabs on it.

Still don't think we have material for a README? :smiley: Down the road, people won't know to read this thread.

On Tue, Nov 17, 2015 at 4:54 AM, Egon Elbre notifications@github.com wrote:

I created golang/go#13288 https://github.com/golang/go/issues/13288 for this. It is doing some interesting case-folding, and I'm not sure why.

— Reply to this email directly or view it on GitHub https://github.com/dimroc/etl-language-comparison/pull/1#issuecomment-157323722 .

egonelbre commented 8 years ago

For the README, probably not, for a blog-post, maybe. For me it's just another "technical details of optimizing". However, the issue might be worth mentioning in the benchmark, i.e. "regexp is slow in Go 1.5 with case insensitive matches, it's being addressed in issue https://golang.org/issue/13288".