Use Abseil hash maps instead of std unordered containers for a huge speedup

lrineau commented 5 years ago

Issue Details

Inspired by one of the cppcon talks I listen to during my free time, I have quickly tried the hash maps from Abseil instead of std::unordered_map.

My try was a modification of copy_face_graph:

diff --git a/BGL/include/CGAL/boost/graph/copy_face_graph.h b/BGL/include/CGAL/boost/graph/copy_face_graph.h
index 9a2cb628825..fd0b602f284 100644
--- a/BGL/include/CGAL/boost/graph/copy_face_graph.h
+++ b/BGL/include/CGAL/boost/graph/copy_face_graph.h
@@ -35,6 +35,7 @@
 #include <boost/unordered_map.hpp>
 #include <boost/utility/enable_if.hpp>
 #include <boost/function_output_iterator.hpp>
+#include <absl/container/flat_hash_map.h>

 namespace CGAL {

@@ -189,7 +190,7 @@ void copy_face_graph(const SourceMesh& sm, TargetMesh& tm,
   typedef typename boost::graph_traits<SourceMesh>::halfedge_descriptor sm_halfedge_descriptor;
   typedef typename boost::graph_traits<TargetMesh>::halfedge_descriptor tm_halfedge_descriptor;

-  boost::unordered_map<sm_halfedge_descriptor,
+  absl::flat_hash_map<sm_halfedge_descriptor,
                        tm_halfedge_descriptor> hash_map(num_halfedges(sm));
   copy_face_graph_impl(sm, tm,
                        boost::make_assoc_property_map(hash_map),

The results are stuning: now 3.2 seconds instead of 7.1 seconds

Source Code

The .cpp code: https://gist.github.com/2e9ea602a7b1e46cd34773611acac3c6

Here is the local patch to CMakeLists.txt (a non-clean hack):

diff --git a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
index ac3bd5256f2..a821195ff9c 100644
--- a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
+++ b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
@@ -67,6 +67,11 @@ create_single_source_cgal_program( "transform_iterator.cpp" )

 create_single_source_cgal_program( "copy_polyhedron.cpp" )

+add_subdirectory(googletest)
+add_subdirectory(abseil-cpp)
+
+target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)
+
 if(OpenMesh_FOUND)
   target_link_libraries( copy_polyhedron PRIVATE ${OPENMESH_LIBRARIES} )
 endif()

I had to checkout abseil and gtest:

$ git clone git@github.com:abseil/abseil-cpp.git
$ git clone git@github.com:google/googletest.git

I ran that modified copy_polyhedron.cpp on demo/Polyhedron/data/man.off, and as I said the speedup is huge: 3.2s after the patch to <CGAL/boost/graph/copy_face_graph.h>, instead of 7.1s.

Environment

Operating system (Windows/Mac/Linux, 32/64 bits):
Compiler: g++ 9.1.1 with -O3
Release or debug mode: Release
CGAL version: CGAL-5.0-dev, master version 44e4d1526c506a9de2566be1eb0dfdd44b04d8e5
Boost version: 1.69.0
Other libraries versions if used (Eigen, TBB, etc.):

lrineau commented 5 years ago

https://github.com/abseil/abseil-cpp license is Apache License 2.0. It is very permissive.

MaelRL commented 5 years ago

https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ had some rather detailed comparisons recently, which seem to yield similar observations.

sloriot commented 5 years ago

Shall @maxGimeno propose a global plan with the introduction of CGAL::unordered_map + the cmake machinery to get it everywhere?

mglisse commented 5 years ago

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed: But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations. Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

lrineau commented 5 years ago

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed:

Hopefully CGAL uses CMake.

add_subdirectory(googletest)
add_subdirectory(abseil-cpp)

target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)

But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations. Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

Our opting for Abseil will be conditional to a macro. With it we will easily be able to build benchmarks. Even users should be made aware of that, so that they can tune.

mglisse commented 5 years ago

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

lrineau commented 5 years ago

@mglisse commented on Jul 4, 2019, 9:38 AM GMT+2:

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

We do not care. Abseil will be an optional dependency of CGAL, and our CMake users will get it almost for free. As for the others... they can do experiments and benchmarks using CMake, and them make the effort in their build system if the speedup is compelling.

Actually, I do not get your point. Do you have a suggestion about an alternative?

afabri commented 5 years ago

Just my two cents, it would be good to see a higher level CGAL algorithm than copy_face_graph() as a motivation.

lrineau commented 5 years ago

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed:

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

maxGimeno commented 5 years ago

@afabri it concerns everything related to unordered maps, not only copy_face_graph().

afabri commented 5 years ago

The questions is if it has measurable impact on Mesh_3, Constrained_triangulation_2, Arrangement_2......

maxGimeno commented 5 years ago

We will probably have to bench to be sure. But I have seen traces of unordered maps at least in Mesh_3 and Arrangements_on_surface_2, so there will probbaly be an impact.

lrineau commented 5 years ago

For at least one user of us, CGAL::PMP::polygon_soup_to_polygon_mesh was a big consumer of time of their process. And it is all about maps.

mglisse commented 5 years ago

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

I didn't mean external dependencies. Using a trivial program:

#include <absl/container/flat_hash_set.h>
typedef absl::flat_hash_set<int> S;
int main(){ S s; }

Complete the command line with a minimal number of -l flags: g++ test.cc -I .../include -L .../lib Here I need -labsl_hashtablez_sampler -labsl_synchronization -labsl_time -labsl_base -labsl_spinlock_wait -labsl_time_zone -labsl_int128 -labsl_symbolize -labsl_stacktrace -labsl_malloc_internal -labsl_demangle_internal -labsl_debugging_internal -labsl_dynamic_annotations -pthread ... Why they didn't at least create a thin archive or linker script so that -labsl would work is an interesting question.

mglisse commented 5 years ago

Actually, I do not get your point.

Just that abseil is a bit inconvenient to use, not as easy as using more Boost stuff for instance (even the parts that require linking with a library).

I don't know yet if you plan to test for the presence of abseil or to include a copy of it.

Do you have a suggestion about an alternative?

Not particularly. Mael posted a link to a comparison between many implementations, some of which are header-only. They don't all have Google maintaining them though, and you may wish to use more stuff from abseil later.

lrineau commented 5 years ago

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:

robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

lrineau commented 5 years ago

@lrineau commented on Jul 4, 2019, 3:05 PM GMT+2:

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:
robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

I have artificially removed -frounding-math, to compile the test, and it is slower than using abseil: 3.8 seconds instead of 3.2 using abseil.

mglisse commented 5 years ago

our use of -frounding-math

I don't remember if you tried recently to remove it, to see what breaks. But that's not the right place to discuss it.

I guess constexpr is a context where gcc should ignore rounding modes...

it is slower than using abseil

The list also includes phmap, which is described as essentially a headerified version of abseil.

GilesBathgate commented 3 years ago

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

Would this benefit from the abseil version of map?

mglisse commented 3 years ago

I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation.

The whole point of this issue is that std::unordered_map is slow, what did you expect?

GilesBathgate commented 3 years ago

@mglisse :joy: Yes, But my reason for investigating it (prior to coming across this PR) was that in some cases, I've found anything other than chained_map to be faster, especially when the number of elements is small because chained_map doesn't lazily initialise minimum of 512 buckets.

For example here: https://github.com/CGAL/cgal/blob/74af3e7d33e7354e4090a1b17941a3f0c372ca8d/Nef_S2/include/CGAL/Nef_S2/SM_const_decorator.h#L392

where visited only contains approximately 6 to 10 elements. The chained_map construction dominates the CPU usage, making O(1) lookup irrelevant probably a std::vector with O(n) would be faster in this scenario. Another example where Unique_hash_map construction is slow is #5143

lrineau commented 3 years ago

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23 (too large to embed)

Would this benefit from the abseil version of map?

Hi Giles, I do not have the time to investigate on your issue. I hope others will, though.

The best way to know if the Abseil hash maps can benefit your code is to try, anyway.

GilesBathgate commented 3 years ago

@lrineau Thanks I've given it a try

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

@mglisse Abseil was 20% slower than std:unordered_map, for me though so I'm confused now.

GilesBathgate commented 3 years ago

@lrineau @mglisse I am getting good results with robin_hood.h I use robin_hood::unordered_node_map (The issue with -frounding math was resolved https://github.com/martinus/robin-hood-hashing/issues/51)

Updated gist: https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

mglisse commented 3 years ago

where visited only contains approximately 6 to 10 elements

Did you try a flatmap for this one? (I know nothing about that code, and in particular not if the size of this container can ever be large) Such small maps are not a typical use case for hash maps, so many are probably not optimized for that case. The code you point to seems suboptimal, it calls visited[v] twice in a row, changing the code a bit, a (hash)set would make more sense than a map (it probably used a map because there was no set in cgal), checking visited.insert(v).second.

Nice if robin hood works well, although if you want to change Unique_hash_map, which is used a lot in CGAL, and not just use a different map in one specific place, reviewers are likely to ask for various benchmarks, to make sure it doesn't regress some other use cases (bigger table, more expensive key, etc). But it seems quite possible that it is indeed faster all around.

GilesBathgate commented 3 years ago

@mglisse Yes, I tried a boost::flat_set and it was much faster I'll add a comment about that to #5379

Regarding Unique_hash_map if I understand correctly it's optimised for when the keys are known to be unique and therefore no collisions. That said robin_hood does seem to out-perform chained_map in my tests, especially in the object construction. Either way, I've found some that some cases require robin_hood::unordered_node_map instead of robin_hood::unordered_flat_map So it doesn't get the cache friendliness benefits. I was thinking of asking @martinus if there is a way to use robin_hood::* in a way that is optimal for unique hash maps.

martinus commented 3 years ago

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount. That way you only get a single allocation. Not much else you can do as far as I see on a quick glance.

Also, as @mglisse said, the linked code looks suboptimal. E.g. unordered_flat_set might help, and I think the std::list could use a custom allocator so it doesn't malloc & free each time you put an element into it. A while ago I've created an allocator for that: https://gist.github.com/martinus/ee4cc9d82d4610647180301c47b030a2 see e.g. this benchmark: https://quick-bench.com/q/BLW3OTlodJ2c3TmQGJUPT_TAdGA

mglisse commented 3 years ago

the std::list could

... be wrapped in a std::queue, which could then use its default backend deque :wink: (I didn't profile that code, no idea what part is relevant to optimize)

GilesBathgate commented 3 years ago

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount.

So this could be done:

-  CGAL::Unique_hash_map<SVertex_const_iterator,bool> visited(false);
+  robin_hood::unordered_flat_set<SVertex_const_iterator,Handle_hash_function> visited;
+  visited.reserve(number_of_svertices());

but unfortunately:

https://github.com/CGAL/cgal/blob/74af3e7d33e7354e4090a1b17941a3f0c372ca8d/Nef_3/include/CGAL/Nef_3/Vertex.h#L214-L219

:disappointed:

GilesBathgate commented 3 years ago

@mglisse Presumably one can use return CGAL::iterator_distance(svertices_begin(), svertices_end()); (or just std::distance) in number_of_svertices?

mglisse commented 3 years ago

@mglisse Presumably one can use return CGAL::iterator_distance(svertices_begin(), svertices_end()); (or just std::distance) in number_of_svertices?

Probably, but why? It would perform exactly the same iteration.

GilesBathgate commented 3 years ago

Probably, but why? It would perform exactly the same iteration.

Because I thought it was a random access iterator, so last-first, it's not of course, so you are right.

GilesBathgate commented 3 years ago

I know nothing about that code, and in particular not if the size of this container can ever be large

@mglisse Yes it can be large if you have a cone with 10,000 sides, the sphere_map of the apex will have 10,000 svertices. so visited would need that many elements. robin_hood::unordered_flat_map with .reserve does seem to be a good option though.

CGAL / cgal