ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.11k stars 440 forks source link

Speed up LineReader::next_line #89

Open chris0x44 opened 4 years ago

chris0x44 commented 4 years ago

Measuring with a large numbers only CSV, this change showed ~25% improvement in reading speed.

ben-strasser commented 4 years ago

I generated a test file as follows:

{
                std::ofstream out("testfile");
                for(int i=0; i<100000000; ++i){
                        out << i;
                        for(int j=0; j<10; ++j)
                                out << ',' << j;
                        out << '\n';
                }
        }

I use

#include "csv.h"
#include <sys/time.h>
long long get_micro_time(){
    timeval t;
    gettimeofday(&t, 0);
    return t.tv_sec*1000000ll+t.tv_usec;
}

#include <iostream>
#include <fstream>

int main(){
        for(int i=0; i<16; ++i)
        {
                long long timer = -get_micro_time();
                int n=0;
                io::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout <<  timer << std::endl;
        }
}

to benchmark. I compile with g++ -O3 main.cpp -std=c++0x -o prog -pthread.

The current version has the following running times:

2227115 2241192 2241685 2271418 2254201 2264049 2268043 2288949 2292267 2324893 2344262 2231287 2182160 2156060 2153017 2145480

The proposed version has the following running times

2036938 2020005 2028032 2025394 2032634 2123398 2062249 2066723 2037287 2048502 2050186 2131472 2047372 2023591 2014507 2021027

The proposed version seems to be about 10% faster. I'll have look at the code tomorrow for subtle bugs. If I find none, I'll integrate the PR.

(Some changes are necessary to make it work with GCC 4.8.4. I will therefore not directly merge it as is.)

chris0x44 commented 4 years ago

Interesting result. I was using the similar code (reduced line count to 50000000) with Visual Studio 2019.

Old implementation

32 bit    64bit
1403602   1399353
1396644   1385363
1412428   1385887
1385986   1428209
1302536   1476942
1324161   1387329
1330136   1375160
1304215   1376447
1286355   1403022
1278173   1408659

New implementation

32 bit    64bit
1066176   1009143
1068074   1019310
1036172   1015235
1055584   1008792
1058937   1015928
1053741   1070592
1063439   1145061
1052232   1004978
1055186   1003172
1058393   1001748

Averaged out ~21% and ~26% improvement. Seems like VS was generating not as efficient code as GCC.

chris0x44 commented 4 years ago

Just realized that you were compiling for -std=c++0x. Would you rather not use full C++11 support to maintain backwards compatibility? Just in case, for future changes.

ben-strasser commented 4 years ago

My policy with respect to old compilers is that if newer compilers provide a significant advantage, then bumping up the required compiler version can be considered. However, so far neither c++14 nor c++17 have provided anything where I'd say that it is worthwhile.

I did some more benchmarking today. First of, all the running times in the low second range are with respect to a preheated file system cache. They are not severed from disk. When the data actually comes from the hard disk, then we are talking double digit seconds and all the optimizations done here do not matter as they are negligible.

That being said, optimizing the case where data comes from the file system cache has its uses.

The results of my micro benchmarks are really strange. I'm using the same test setup as before. I get the following running times for the current version:

2118994 2122616 2125639 2123606 2118886 2144089 2126694 2127466 2133072 2135363 2138806 2139697 2133833 2134297 2138764 2161570

This is consistent with yesterday. Now all I do is replace

                        int line_end = data_begin;
                        while(buffer[line_end] != '\n' && line_end != data_end){
                                ++line_end;
                        }

with

                        int line_end = data_begin;
                        while(line_end != data_end && buffer[line_end] != '\n'){
                                ++line_end;
                        }

The only difference is the order of the operands of the &&.

The times that I see are:

1910613 1932081 1925265 1924471 1939898 1938497 1932861 1925658 1927700 1928857 1924995 1933352 1937317 1926166 1933898 1933849

The order of those two operands seems to have a significant impact.

When I rerun the version from this PR, I get:

1921082 1918192 1922095 1918119 1917171 1957642 2027880 2028473 2017751 2017715 2021179 2021563 2016718 2021120 2018545 2022833

From what I gather from these numbers is that the best is to swap the order of the &&-operands and let all advanced bit-tricks be.

ben-strasser commented 4 years ago

The following version is also bad:

                        int line_end = data_begin;
                        const char*buffer_ptr = buffer.get();
                        while(line_end != data_end && buffer_ptr[line_end] != '\n'){
                                ++line_end;
                        }

The running times that I observe are:

2287366 2286987 2273850 2280842 2277892 2300054 2274304 2276366 2273849 2274207 2273673 2274042 2280661 2275466 2284489 2310198

This does not make sense to me.

ben-strasser commented 4 years ago

Clang 9.0.0 produces running times with 2.1 sec regardless of the tested version. I think that the current version on master slightly wins out. However, the difference is so slow that it might be a random fluctuation.

chris0x44 commented 4 years ago

Nice catch. I think I came across this, but thought it was a fluke (didn't look further into it). Now I compared the two versions (check end first vs. check LF first) and the result is very interesting:

size_t get_line_lf_end(char* buffer, int data_begin, int data_end) {
    int line_end = data_begin;
    while(buffer[line_end] != '\n' && line_end != data_end){
            ++line_end;
    }
    return line_end;
}

Compiles to this:

get_line_lf_end(char*, int, int):
        movsx   r8, esi
        cmp     BYTE PTR [rdi+r8], 10
        je      .L1
        cmp     esi, edx
        je      .L1
        add     esi, 1
        movsx   rax, esi
.L3:
        cmp     BYTE PTR [rdi+rax], 10
        mov     r8, rax
        sete    sil
        cmp     edx, eax
        sete    cl
        add     rax, 1
        or      sil, cl
        je      .L3
.L1:
        mov     rax, r8
        ret

While this:

size_t get_line_end_lf(char* buffer, int data_begin, int data_end) {
    int line_end = data_begin;
    while(line_end != data_end && buffer[line_end] != '\n'){
            ++line_end;
    }
    return line_end;
}

Compiles to this:

get_line_end_lf(char*, int, int):
        movsx   rax, esi
        cmp     esi, edx
        jne     .L10
        jmp     .L7
.L13:
        lea     ecx, [rax+1]
        add     rax, 1
        cmp     edx, eax
        je      .L12
.L10:
        cmp     BYTE PTR [rdi+rax], 10
        jne     .L13
.L7:
        ret
.L12:
        movsx   rax, ecx
        ret

You can see that the second implementation generates a much tighter loop, where je .L12 is practically never taken. This is the assembly generated by GCC 9.2 with -O3 Here's the link to the compiler explorer sample: https://godbolt.org/z/KBANiA

I also did a test run with the switched comparison on Windows and it showed ~10%-15% improvement over the original version. Which is quite good.

chris0x44 commented 4 years ago

One more thing regarding the bit-tricks commit. Can you do a testrun with the following testfile:

   {
      std::ofstream out("testfile");
      for (int i = 0; i < 10000000; ++i) {
         out << i;
         for (int j = 0; j < 10; ++j)
            out << ',' << (1000000+j);
         out << '\n';
      }
   }

It's a scenario where each field contains 6 digit numbers instead of a single digit. The reason I ask is that I'm seeing the following results on my machine: Basline

949544
889969
888445
856085
944465
827497
876116
794432
731962
802601

Bit-trickery

443171
437787
447546
457375
434767
451046
432036
432197
431616
426336

Which amounts to almost twice the parsing speed on Windows and I'm very curious what GCC/Clang makes of this.