ddiakopoulos / tinyply

:earth_africa: C++11 ply 3d mesh format importer & exporter
615 stars 119 forks source link

Writing large files is slow #4

Closed serialhex closed 6 years ago

serialhex commented 7 years ago

I'm not sure what kinds of transformations on the data you're doing as it's being written, it doesn't look like much, but for whatever reason it is significantly slowing down the writing of files. I've got files that are ~0.5-1 GB that I'm exporting using your library (which - other than this hiccup - works beautifully btw) and it is suuuuuuuuuuuuper slow. Here is a super-accurate benchmark for you:

Writing to disk... Buffer size is: 1048576 bytes
Closing file... Took 142169 ms
fstream mode
170 ms

That's like 830x slower!!! I understand like 2x... or 4x, but 800!?!?!? The relevant code snippet is as follows:

const size_t bufsize = 1024 * 1024;
  { // just to help deny file-handle leak
    // std::filebuf fb;
    // fb.open(filename, std::ios::out | std::ios::binary);
    // std::ostream outputStream(&fb);
    std::fstream outputStream(filename, std::fstream::out | std::fstream::binary | std::fstream::trunc);
    // set up buffering
    // const size_t bufsize = 1024 * 1024;
    char* buf = (char*)malloc(bufsize);
    outputStream.rdbuf()->pubsetbuf(buf, bufsize);
    std::cout << "Writing to disk... Buffer size is: " << bufsize << " bytes\n";
    auto start = std::chrono::high_resolution_clock::now();
    ply.write(outputStream, true);
    auto duration = (std::chrono::duration_cast<std::chrono::milliseconds>(
        std::chrono::high_resolution_clock::now() - start).count());
    std::cout << "Closing file... Took " << duration << " ms\n";
    free(buf);
    outputStream.close();
  }

  // test for speedz
  {
    const size_t sz = 512 * 1024 * 1024;
    const int numiter = 1;
    std::unique_ptr<char[]> data(new char[sz]);
    std::unique_ptr<char[]> buf(new char[bufsize]);
    for (size_t p = 0; p < sz; p += 16) {
      memcpy(&data[p], "BINARY.DATA.....", 16);
    }
    unlink("file.binary");
    int64_t total = 0;
    std::cout << "fstream mode\n";
    std::ofstream myfile("file.binary", std::ios::out | std::ios::binary);
    if (!myfile) {
      std::cerr << "open failed\n"; return 1;
    }
    myfile.rdbuf()->pubsetbuf(buf.get(), bufsize); // IMPORTANT
    for (int i = 0; i < numiter; ++i) {
      auto tm1 = std::chrono::high_resolution_clock::now();
      myfile.write(data.get(), sz);
      if (!myfile)
        std::cerr << "write failed\n";
      auto tm = (std::chrono::duration_cast<std::chrono::milliseconds>(
        std::chrono::high_resolution_clock::now() - tm1).count());
      std::cout << tm << " ms\n";
      total += tm;
    }
    myfile.close();
  }

and the file I'm writing using TinyPLY ends up being ~312MB while the test for speedz section outputs a 512MB file. How can we speed this up?? Let me know if there is anything in particular you want me to try.

Thanks in advance!

ddiakopoulos commented 7 years ago

Good catch. Can you do me a big favor and hunt down the block of code that's causing the internal slowdown? By virtue of it's tiny-ness, you should only need to look at 3-4 areas with scoped timers to find the offending function if you don't have any profiling tools handy.

class scoped_timer
{
    std::string message;
    std::chrono::high_resolution_clock::time_point t0;
public:
    scoped_timer(std::string message) : message{ std::move(message) }, t0{ std::chrono::high_resolution_clock::now() } {}
    ~scoped_timer()
    {
        std::cout << message << " completed in " << std::to_string((std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - t0).count() * 1000)) << " ms" << std::endl;
    }
};
#define PLY_SCOPED_TIMER(MESSAGE) scoped_timer scoped_timer_ ## __LINE__(MESSAGE)
serialhex commented 7 years ago

Hey, nice little class & macro! I'm taking that & adding it to my toolkit! :-D

Anyway, I've taken the first 200 lines of output (as the file is ~4GB and I didn't let it finish) along with a diff of the changes I made so you can see where I'm calling the scoped timer from. I made it a gist https://gist.github.com/serialhex/7f729c641f47f58c3e174b2c67e8d851 so I don't clutter the messages here. Let me know if there are any other details you would like.

(Yes, ~4GB. We're making models with ~14 million points, so they are a little big)

ddiakopoulos commented 7 years ago

Interesting, thanks! I'll take a deeper look over the weekend and let you know what I find.

ddiakopoulos commented 7 years ago

@serialhex I'm seeing around a 10 x slowdown between plyfile.write(...) (1gb takes around 50 seconds) and standard ostream binary write (5 seconds). Also keep in mind disk bandwidth: I'm writing to a Samsung Evo Pro SSD. One other difference in my ostream binary test is I'm allocating a 1gb vector and using a single .write(...) call, instead of the many used internally to tinyply -- this will undoubtedly have an effect.

One obvious thing -- can you make sure you are compiling in Release (-03) mode? It's tremendously slow in debug (easily 20x on my machine).

I did identify one particularly slow call, https://github.com/ddiakopoulos/tinyply/blob/master/source/tinyply.cpp#L223 When I optimize the way this works, the plyfly.write(...) 1gb test completes in ~20 seconds, so around a 2.5x improvement. It's still around 4x slower over the vanilla ostream but this seems much more reasonable.

ataber commented 7 years ago

I too am noticing slowdown for large files. Will you be committing the optimization of the slow call you mentioned above? Thanks for the tool, super useful :)

ddiakopoulos commented 7 years ago

Hi @ataber - sure, I just pushed it on a branch: https://github.com/ddiakopoulos/tinyply/tree/writefix

I didn't test it on any real data though (which is one reason I haven't pushed it yet!)

ataber commented 7 years ago

Just watched this great talk from cppcon, where the speaker mentions that std::endl flushes the stream...after searching thru my code, I found that this library uses std::endl a lot (in fact after every line of the PLY body - see here). I'm testing a change on my machine that replaces std::endl with "\n" - do you have any data that you typically benchmark on?

maruncz commented 7 years ago

hi you can test with these files https://marun.uk.to/owncloud/s/s0vLdYWaE3ADbRa

bchretien commented 7 years ago

👍 for avoiding std::endl until the very end.

ddiakopoulos commented 7 years ago

+1 - @ataber can you follow up with some benchmarks? I'm aware of the flushing penalty and also that the replacement to \n is trivial, but I'd be curious :)

ataber commented 7 years ago

Thanks for the patience - you know how it is with open source :)

Here's some benchmark data for writing an ASCII file with just vertex and triangular facet data: Commit: f6a8c5d069e56118eac40d268db93fdb5eb58035 (with "\n" replacing std::endl)

Number of Vertices | Number of Facets | CPU time | Wall time
------------| ----- | --------- | ----
196610 | 393216 | 1.8054 seconds | 2.2114 seconds
1066690 | 2133376 | 10.1649 seconds | 12.441 seconds
191156 | 382472 | 1.49407 seconds | 1.86911 seconds
286914 | 573824 | 2.59925 seconds | 3.17874 seconds
331628 | 663252 | 3.15947 seconds | 3.8565 seconds
145108 | 290946 | 1.24691 seconds | 1.51113 seconds
634220 | 1376528 | 7.15963 seconds | 8.5375 seconds
127234 | 254464 | 1.11448 seconds | 1.36364 seconds
2308760 | 4617580 | 23.0025 seconds | 28.2147 seconds
314607 | 630334 | 3.00653 seconds | 3.63565 seconds
4410859 | 8821758 | 37.4868 seconds | 47.9383 seconds
169448 | 334592 | 1.22074 seconds | 1.47941 seconds

(ugh markdown won't display it right, but you get the idea hopefully) Commit: be7f9dea48f066949c2b734b18f302c4b11214cd (with std::endl)

Number of Vertices | Number of Facets | CPU time | Wall time
---- | ---- | ---- | ----
196610 | 393216 | 12.4065 seconds | 112.37 seconds
1066690 | 2133376 | 71.3297 seconds | 602.803 seconds (had to switch to lower-resolution dataset after this datapoint as it was taking too long)
3074 | 6144 | 0.228811 seconds | 2.0576 seconds
21532 | 43060 | 2.2884 seconds | 13.3025 seconds
102259 | 204678 | 6.14775 seconds | 60.0891 seconds
2282 | 4560 | 0.262724 seconds | 1.59634 seconds
7962 | 15920 | 0.655999 seconds | 6.29141 seconds
10073 | 20858 | 1.03503 seconds | 8.72427 seconds
12208 | 26028 | 1.00501 seconds | 9.09585 seconds
1327 | 2650 | 0.104316 seconds | 0.893529 seconds

Thanks for the great library!