dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
186 stars 33 forks source link

php single thead implementation #12

Closed pqr closed 9 years ago

pqr commented 9 years ago

requires php 5.4+

My machine (ssd, 16Gb, i7 haswell) results compared to golang: golang 1.4 with -strategy=substring: 43,02s user 1,09s system 608% cpu 7,254 total golang 1.4 with -strategy=regex: 168,96s user 0,75s system 736% cpu 23,039 total php 5.6: 10,85s user 0,14s system 99% cpu 11,003 total

One sigle PHP thead works as good as golang 6 gorutines!

FractalizeR commented 9 years ago

That's really great ;) What about PHP7? It should be more or less stable for testing such things, I think.

bobrik commented 9 years ago

Just to make it clear: stripos only works for ascii, in golang it works on unicode. ToLower is slow in go indeed.

Go version can be optimized by using []byte instead of string to make GC happier, plus:

func toLower(b []byte) {
    for i, c := range b {
        if c >= 'A' && c <= 'Z' { b[i] += 'a' - 'A' }
    }
}

Results (php, go with 1 thread, go with 2 threads):

root@8e7d09e496e7:/home/bobrik/etl-language-comparison# ./run_php && ./run_php && ./run_php
13.301
12.558
12.506
root@8e7d09e496e7:/home/bobrik/etl-language-comparison# ./run_go && ./run_go && ./run_go
11.265
10.970
10.977
root@8e7d09e496e7:/home/bobrik/etl-language-comparison# ./run_go && ./run_go && ./run_go
5.691
5.645
5.676

It won't be fair, you know:

root@8e7d09e496e7:/home/bobrik/etl-language-comparison# time fgrep -ri knicks tmp/tweets > /dev/null
1.828
dimroc commented 8 years ago

Hi @pqr et al, It would be great to capture details of the implementation in a README.md in the php/ folder. @pqr , @bobrik are you up to the task? It shouldn't take long.