Loading the frontal face shape predictor calls 135053 times malloc.

mcourteaux commented 6 months ago

Main idea

I am not familiar with the dlib codebase, but it seems there is some mem_manager stuff happening in quite some places. As the whole dlib::deserialize<> traversal is doing a bunch of small news, this is ideal for a bump allocator (a.k.a "memory arena").

I get that it's not trivial to integrate that into the STL containers being used. STL uses something called "polymorphic resources" in the std::pmr:: namespace, which supports bump allocators.

However, most allocations happen inside dlib::matrix (I estimate 70% of them).

So, I instrumented operator new() and operator delete() to keep track of these things. The result is that during loading of the frontal face shape predictor here is what happens:

135053 allocations
4 frees
total allocation 68MB.
average allocation size: 515 bytes per malloc.
70% of those allocations all happen inside dlib::matrix.

Overall, I'd argue that this is bad for performance.

I actually tested it, and replaced the default operator new behavior by using a bump allocator (memory arena), and the load time went from 1.75s to 1.18s, which is a 48% performance increase.

Anything else?

No response

davisking commented 6 months ago

You sure you are building with optimizations turned on? For instance, I get a load time of 480ms for loading the 68 point model.

On Wed, Feb 28, 2024 at 11:30 AM Martijn Courteaux @.***> wrote:

Main idea

I am not familiar with the dlib codebase, but it seems there is some mem_manager stuff happening in quite some places. As the whole dlib::deserialize<> traversal is doing a bunch of small mallocs/news, this is ideal for a bump allocator (a.k.a "memory arena").

I get that it's not trivial to integrate that into the STL containers being used. STL uses something called "polymorphic resources" in the std::pmr:: namespace, which supports bump allocators.

However, most allocations happen inside dlib::matrix (I estimate 70% of them).

So, I instrumented operator new() and operator delete() to keep track of these things. The result is that during loading of the frontal face shape predictor here is what happens:

135053 allocations

4 frees

total allocation 68MB.

average allocation size: 515 bytes per malloc.

70% of those allocations all happen inside dlib::matrix.

Overall, I'd argue that this is bad for performance.

I actually tested it, and replaced the default operator new behavior by using a bump allocator (memory arena), and the load time went from 1.75s to 1.18s, which is a 48% performance increase. Anything else?

No response

— Reply to this email directly, view it on GitHub https://github.com/davisking/dlib/issues/2919, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPYFR3I26ZXK4BAUYQLPEDYV5LRBAVCNFSM6AAAAABD6LCLXGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2TSMZVHA4TMOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mcourteaux commented 6 months ago

Sorry, my reported times were actually from both the "frontal_face_detector" AND the "shape_predictor_68_face_landmarks" together. Let me break down more clearly what's happening:

First, for this answer, I bumped the compile flag from -O2 to -O3, as per your suggestion.
frontal face detector:
- normal new: 13395 mallocs + 12299 frees, taking 266ms.
- instrumented normal new: 13395 mallocs + 12299 frees, taking 340ms.
- memory arena new: 0 mallocs + 0 frees, taking 288ms.
- instrumented memory arena new: 0 mallocs + 0 frees, taking 281ms.
68 face landmarks shape predictor:
- normal new: 135053 mallocs + 4 frees, taking 876ms.
- instrumented normal new: 135053 mallocs + 4 frees, taking 1.34s.
- memory arena new: 0 mallocs + 0 frees, taking 783ms.
- instrumented memory arena new: 0 mallocs + 0 frees, taking 836ms.

So, my timings were too much influenced by the fact that I was recording the allocations and frees with too much detail. Looking at the non-instrumented timings, the 68 face landmarks shape predictor speeds up from skipping all the malloc work with about 11%. Doing the same for the frontal face detector slows it down by 8%, which is perhaps due to cache misses, as it allocates round 2MB and frees up again 1MB during loading.

I don't know how you manage to load the 68 point model so quickly. I'm using this snippet:

std::string path = "shape_predictor_68_face_landmarks.dat";
try {
  dlib::deserialize(path) >> m_internals->sp_face_landmarks;
  return true;
} catch (const dlib::serialization_error &e) {
  spdlog::error("Could not load {}: {}", path, e.what());
  return false;
}

davisking commented 6 months ago

I was just doing deserialize(argv[1]) >> sp;

Anyway, you shouldn't need to worry about this startup time right? Just don't do it more than once?

dlib-issue-bot commented 4 months ago

Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-04-14 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

mcourteaux commented 4 months ago

I indeed do it once, but this is a very expensive wait time of 1100ms. My computer can read more than 1GB/s (sequential reading) from SSD. The thing we are loading is 70MB, which should take less than 70ms, not 1142ms. Of course, I'm aware of the base64-decode happening for the FFD. Overall, what I'm trying to say is that this way of making it user-friendly (read: programmer-friendly) is actually making it unsuitable for production code. It's a bad user experience if this thing takes 1.1s to load 70MB of coefficients.

arrufat commented 4 months ago

I was never inconvenienced by the loading time of the shape predictor model. Out of curiosity, I just timed how long it takes on my machine, and it's about 350 ms.

Here's what I did, using the webcam face pose example program.

Add this at the top:

#include <chrono>
using fms = std::chrono::duration<float, std::milli>;

Time the loading:

const auto t0 = std::chrono::steady_clock::now();
deserialize("shape_predictor_68_face_landmarks.dat") >> pose_model;
const auto t1 = std::chrono::steady_clock::now();
cout << "shape predictor loaded in " << chrono::duration_cast<fms>(t1 - t0).count() << " ms\n";

dlib-issue-bot commented 3 months ago

Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-05-22 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

dlib-issue-bot commented 3 months ago

Warning: this issue has been inactive for 43 days and will be automatically closed on 2024-05-22 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

dlib-issue-bot commented 3 months ago

Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error.

mcourteaux commented 3 months ago

Solved by ignoring it long enough. Nice.

davisking / dlib