High malloc() time when using multiple threads on linux with C++

AngusJohnson / Clipper2

Polygon Clipping and Offsetting - C++, C# and Delphi

Boost Software License 1.0

1.5k stars 273 forks source link

High malloc() time when using multiple threads on linux with C++ #394

Closed fegennari closed 1 year ago

fegennari commented 1 year ago

I'm using the C++ code for Clipper2 on linux. In general it works very well for my use model of computing pairwise XORs of many small polygon clips. However, when I run with OpenMP multithreading I see it doesn't scale beyond ~4 cores due to the thread lock in the memory allocator.

There are many small memory allocations of individual items. For example, in ClipperBase::InsertLocalMinimaIntoAEL(), there are two calls like this: left_bound = new Active(); It seems like these Active objects are constructed at the beginning and deleted at the end. Would it be possible to block allocate a group of Active objects rather than allocating them individually? I tried replacing this with a std::deque and using it like this: activealloc.push_back(Active()); left_bound = &activealloc.back(); This appears to work and reduces runtime by about 10%. In addition, you can simply clear the deque at the end rather than having to iterate over the entire linked list in DeleteEdges().

Some other new call that shows up near the top of the profile are the "new OutPt" lines in AddPaths() and AddOutPt(). I'm not sure what the best fix for this is, maybe a custom allocator with a free list.

Fixing all of these could be a 2x improvement, or even more when using more threads. For reference, my flow is computing ~4 billion small Boolean operations and these calls take about 75% of the ~2 hours total runtime with 4 threads.

Thank you.

AngusJohnson commented 1 year ago

Hi and thank you for your valuable suggestions. And of course I'm keen to optimize Clipper2 performance (within reason 😜), so I'll certainly look at using std::deque.

In addition, you can simply clear the deque at the end rather than having to iterate over the entire linked list in DeleteEdges().

Actually DeleteEdges() is really only a fail-safe cleanup in case something went wrong. Otherwise Actives are always deleted in DeleteFromAEL(). But I've taken on board your point about improving performance with block memory allocations and how std::deque can facilitate that. (No doubt you've noticed I'm not overly experienced in C++ 😁).

Some other new call that shows up near the top of the profile are the "new OutPt"

Perhaps I can use another std::deque for these too.

fegennari commented 1 year ago

Thanks. Let me know if you have any changes you want me to test. I can't really share my test cases, but it's easy for me to modify the code and run it through the profiling flow.

AngusJohnson commented 1 year ago

activealloc.push_back(Active()); left_bound = &activealloc.back();

AFAICS, this will only work (without modification as per below) as long as Actives aren't deleted until they're all cleared together at the very end of the clipping op. And this would consume a lot of memory, especially with the sorts of data sets you've described above.

However, I could consider keeping a secondary queue<int> that contains indexes to "inactive" Actives in queue<Active>, and reusing them. This way queue<Active> would only need to be as large as the scanline with the most Actives.

fegennari commented 1 year ago

I don't entirely understand what InsertLocalMinimaIntoAEL() does. The while(succeeded_) loop at the top of ClipperBase::ExecuteInternal() calls this. How many times does the while() loop normally execute? I was thinking the Active objects were all added at the beginning and removed at the end. If that's not the case and you think it's a problem, then maybe a different solution is needed.

Memory usage isn't a problem in my flow. I'm making up to billions of calls to the various BooleanOp() functions with a small number of polygons per call. I believe all of the state with the Active objects is recreated each time, and this is what makes it slow. But maybe I'm just using it wrong. Is it legal to create the Clipper64 object once and reuse it for many Add...() and Execute() calls? Is there something I need to call between uses, or does the CleanUp() at the end of Execute() do all the clean up that's needed?

In that situation, it could be better to use a free list for the Active (and other similar) objects that are new'ed and deleted repeatedly. Instead of deleting them, add them to the free list, and then take the next item on the free list (if nonempty) when you need a new object. This would work for Active, OutPt, OutRec, etc.

Note that Clipper2 is already ~15x faster than boost::polygon on this type of flow, so it reduces some of these runs from hours to tens of minutes! But it seems like boost::polygon is faster when there are millions of polygons, so I have to select between the two libraries based on polygon/vertex count.

AngusJohnson commented 1 year ago

I don't entirely understand what InsertLocalMinimaIntoAEL() does. The while(succeeded_) loop at the top of ClipperBase::ExecuteInternal() calls this. How many times does the while() loop normally execute?

All the vertices in the clipping dataset are ordered on their y members, and the number of iterations of the while loop equals the number of different values for y.

I was thinking the Active objects were all added at the beginning and removed at the end. If that's not the case and you think it's a problem, then maybe a different solution is needed.

Nope, and yep 😜.

Memory usage isn't a problem in my flow. I'm making up to billions of calls to the various BooleanOp() functions with a small number of polygons per call.

OK, if you're only adding a small number of polygons per call, then memory certainly won't be a problem for you.

I believe all of the state with the Active objects is recreated each time, and this is what makes it slow.

I'm sure this does slow things down, even with smallish datasets.

But maybe I'm just using it wrong. Is it legal to create the Clipper64 object once and reuse it for many Add...() and Execute() calls?

Absolutely. Just call Clipper64.Clear() after each op to remove the existing data.

Is there something I need to call between uses, or does the CleanUp() at the end of Execute() do all the clean up that's needed?

Cleanup() only removes the "work data", not the subject and clip polygon data. So it's possible to add additional polygons or perform a different clipping operatin on the same polygons with a second call to Execute().

In that situation, it could be better to use a free list for the Active (and other similar) objects that are new'ed and deleted repeatedly. Instead of deleting them, add them to the free list, and then take the next item on the free list (if nonempty) when you need a new object. This would work for Active, OutPt, OutRec, etc.

I'm not sure what you mean by a "free list" unless it's pretty much what I proposed in my previous post.

Note that Clipper2 is already ~15x faster than boost::polygon on this type of flow, so it reduces some of these runs from hours to tens of minutes! But it seems like boost::polygon is faster when there are millions of polygons, so I have to select between the two libraries based on polygon/vertex count.

I'm delighted to hear that the 100's of hours I've put into Clipper and Clipper2 is proving worthwhile. However, I am aware that Clipper2 has roughly On³ performance, though that's with a very long and flat start to the performance curve. I'm certainly keen to address that not so flat part of the curve 🤣.

fegennari commented 1 year ago

Thanks for the answers, that clears up a lot of the questions I had. I'll see if I can get better performance from reusing the Clipper64 class when I get back to working on this tomorrow.

Yes, what you describe sounds like a "free list". We just used different terminologies.

I actually wrote my own polygon processing library, but it only works for Manhattan and 45 degree geometry. All of the operations are linear in runtime with the dataset and it scales to billions of shapes. Unfortunately, the approach I used doesn't generalize to all angle polygons. This week I've been looking for a better solution to this particular problem. I originally used boost::polygon, partially because it was written by one of my coworkers. But it doesn't seem like the right solution for many small calls. All-angle polygons are very complex to work with, especially when they contain holes.

AngusJohnson commented 1 year ago

However, I could consider keeping a secondary queue<int> that contains indexes to "inactive" Actives in queue<Active>, and reusing them. This way queue<Active> would only need to be as large as the scanline with the most Actives.

I've just tried this now and sadly it made no measurable difference in my benchmark tests. Clipper2Lib.zip

Edit: But it's not worse performance and perhaps this might work better on PCs that can use more than 4 cores.

fegennari commented 1 year ago

Thanks for the effort! It's a bit hard to tell what you changed though because the diff is very long. Your zip file appears to have different line ends so that I get many ^M characters in linux. I can work around this, but there are also other changes I haven't picked up that you made this week such as changes to exception handling. Do you have the diff associated with just this change?

I ran some perf tests, but these are smaller than the very large test I ran yesterday: Test 1a - Manhattan layout, 1 thread: Original: 11.15s My deque hack solution: 10.83s Your new code: 10.82s

Test 1b - Manhattan layout, 8 threads: Original: 22.61s CPU / 2.79s elapsed My deque hack solution: 20.74s / 2.64s elapsed Your new code: 21.16s CPU / 2.7s elapsed

Test 2a - All-Angle layout, 1 thread: Original: 107.1s My deque hack solution: 105.45s Your new code: 112.91s

Test 2b - All-Angle layout, 8 threads: Original: 340.81s CPU / 42.76s elapsed My deque hack solution: 325.95s CPU / 40.8s elapsed Your new code: 362.86 CPU / 45.52 elapsed

So I guess you're right, it doesn't make a big difference. It did show up as ~10% of the runtime in the profiler though, which is odd. Maybe part of the problem is that the host I'm using has other running processes and this is causing some random variation in the 8 thread runtimes. I really need to rerun that multi-hour job over the weekend.

Okay, I ran the profiler again. Now that time that was spent in new/malloc() is spent in deque::emplace_back(). So maybe the fact that you're doing one larger allocation rather than many smaller ones doesn't help. Then why did it seem to work in my solution?

Your changes aren't consistently better. In fact they seem to make the second case slower. I need to run more tests next week, do some experimental code changes, and get back to you. Maybe what I'm looking for is a way to use a custom per-thread allocator in there. However, that seems like a lot of work to add, so it's not something I'm asking for at this time.

Thanks again!

AngusJohnson commented 1 year ago

In clipper.engine.h

  struct Active {
    ...
    //added ...
    size_t container_idx = 0;
    Active(size_t idx) : container_idx(idx) {};
  }

  typedef std::deque<Active> ActivesList;

  class ClipperBase {
  private:
    //added ...
    std::queue<size_t> inactive_indexes_;
    ActivesList active_container_;

    inline Active& GetNewActive();
    inline void MakeActiveInactive(Active& e);
  }

In clipper.engine.cpp

Added ...

  inline Active& ClipperBase::GetNewActive()
  {
    if (inactive_indexes_.empty())
      return active_container_.emplace_back(
        Active(active_container_.size()));

    size_t idx = inactive_indexes_.front();
    inactive_indexes_.pop();
    return active_container_[idx];
  }

  inline void ClipperBase::MakeActiveInactive(Active& e)
  {
    inactive_indexes_.push(e.container_idx);
    e.join_with = JoinWith::None;
    e.wind_cnt = 0;  //
    e.wind_cnt2 = 0; // important!!
  }

Modified ...

  void ClipperBase::CleanUp()
  {
    //DeleteEdges(actives_);
    active_container_.clear();
    std::queue<size_t> empty;
    std::swap(inactive_indexes_, empty);
    ...
  }

  void ClipperBase::InsertLocalMinimaIntoAEL(int64_t bot_y)
  {
  ...
    //left_bound = new Active();
    left_bound = &GetNewActive();
  ...
    //right_bound = new Active();  
    right_bound = &GetNewActive();
  }

  inline void ClipperBase::DeleteFromAEL(Active& e)
  {
  ...
    //delete& e;
    MakeActiveInactive(e);
  }

fegennari commented 1 year ago

Thanks. That all looks good, but I'm surprised it doesn't help in my tests. I guess this could make it easier to use a custom allocator wit the deque. Are you planning to commit this change, or only if we can prove it helps?

AngusJohnson commented 1 year ago

Are you planning to commit this change, or only if we can prove it helps?

The latter since it does add another layer of complexity.

fegennari commented 1 year ago

Okay. I'll see if I can come up with another solution to the malloc() problem. It's not a huge issue though, and not something I want to spend an unreasonable amount of time on. You can close this issue if you'd like. Thanks.