andreesteve / voronoice

A nice and fast way to construct 2D Voronoi Diagrams
MIT License
34 stars 9 forks source link

Multi threading support #16

Open tirithen opened 1 year ago

tirithen commented 1 year ago

Thanks for a really easy to use package.

I have been experimented with it to generate large 2d maps for games, and have seen that it soon starts taking a long time to generate while 15 of my 16 cores are idle.

Has there been any experiments on making this package multi threaded?

andreesteve commented 1 year ago

Hi - could you please share some more details of your use case?

Did you consider running the voronoi generation on a separate thread from your main loop or ahead of time on a separate thread?

1 million sites takes about 1 second on my 10 year old CPU. I benchmark from 1k to 2M sites, and the time consumption is linear with the input size. I haven't tried beyond 2M, and I suspect there are likely optimizations needed for larger input.

Do you see time consumption increasing non-linearly at some input?

tirithen commented 1 year ago

Hi! I'm using the library to generate maps like this smaller 10 000 points map, where I use voronoice as an early step, then mix it with other noise and threshold functions, but my goal is to generate much larger maps.

10000 points map

    let mut random_generator = ChaCha8Rng::seed_from_u64(seed as u64);
    let mut voronoi_points = vec![];

    for _ in 0..points_count {
        let x = (random_generator.next_u32() as f32 / std::u32::MAX as f32) * size as f32;
        let y = (random_generator.next_u32() as f32 / std::u32::MAX as f32) * size as f32;
        voronoi_points.push(VoronoiPoint {
            x: x as f64,
            y: y as f64,
        });
    }

    println!("points count: {}", voronoi_points.len());
    let start = SystemTime::now();
    let voronoi = VoronoiBuilder::default()
        .set_sites(voronoi_points)
        .set_bounding_box(BoundingBox::new_centered_square(size as f64 * 2.0))
        .set_lloyd_relaxation_iterations(10)
        .build()
        .ok_or(anyhow!("Failed to generate voronoi"))?;
    println!("voronori took: {:?}", start.elapsed());

In the case above it is fast as you mention, it takes 142.384325ms.

I then did a benchmark of the intended goal of 4 000 000 points and the voronoi took 122.498159366s, which is a slower result than I had in mind, especially as only one of 16 CPU threads are utilized.

These measurements was made on a laptop with a 8 core (16 thread) AMD Ryzen 7 5800U CPU, and 16 GB ram.

This mentioned, I do have other parts of my code that generates the above image that are also way to slow at the moment, there also due to lack of parallelism.

Maybe splitting the voronoi into sections, and then stitching them together similar to this blog post on parallel possion disk sampling is a strategy that could work for utilizing more of the CPU threads?

andreesteve commented 1 year ago

Cool picture! I love maps. The reason I wrote this library was to generate voronoi based maps!

Can you confirm you are getting these numbers compiling for release (i.e. not debug, with optimizations enabled)?

10 lloyd iterations is like generating 10 times the voronoi graph. The lloyd iterations are not parallelizable I believe, since the output of one iteration is the input for the next. This means whatever the time it takes to process 4M points, multiplied by 10 is about what I would expect the total time to take.

So as a quick workaround, it might be interesting to try looking at the results of 2 lloyd iterations vs 10. For example, 100 points with 2 iterations on the left and 10 on the right:

I think the more points you have (for the same area), the faster it converges. The higher number of iterations usually help smooth the edges, but in your case, you will likely make all of them ocean/water, so if they are bigger/less detailed, it shouldn't matter (at least that is what I do on my maps).

To help me compare numbers with my local results, would you mind running cargo bench on this library's repo and sharing the results here? For reference, what I get for 5M and 10M points (the benchmark runs without lloyd iterations):

very_large/5,000,000 random sites                        
                        time:   [5.9112 s 5.9975 s 6.0951 s]

very_large/10,000,000 random sites                        
                        time:   [13.000 s 13.331 s 13.640 s]

Adding multi-thread support is an interesting project. It might be easier to do it on the application side. For example, for maps, you could generate separate voronoi graphs in parallel and stitch them together by making all sites on the hull/edge be water (and the bounding box size would be proportional to the size of the island/continent you want to generate).

I will profile the library for 5M sites and see where the time goes. If I recall well, more than half is on the delaunay triangulation, so we may end up having to ask for multi-thread support on the upstream library as well.

tirithen commented 1 year ago

Thanks, yes maps are nice, I find it interesting how Dwarf Fortress and similar use procedural generation for the maps. And then when I found the beautiful Mapgen4 with the related articles from Red Blob Games that impressed me a lot, especially as the articles explains quite a lot of good tricks to generate various elements of a map.

I did run my measurements with cargo run --release, when I re-ran them today I got twice the speed than the last time though, I think I found a CPU setting that I had in power save mode the last time. I now got 60 milliseconds for 10 000 points, and 62 seconds for the intended 4 000 000 points, still too slow as I need to run other noise and eventually some path finding logic for rivers ect.

Thank you for the images and explanation on the lloyd iterations, thing is that the one on the right looks better as I want somewhat even spacing between the points as that seem to reduce the amount of jagged edges. I have thought about using possion disk sampling instead of just randomized points as the initial points to possibly reduce the need for lloyd. If that is even faster, I'm not sure yet.

Another idea that came in mind to me is to try reducing the density of the sea points as I'm not currently interested in the variation there.

Stitching by island as you suggested might also be something, but I'm currently aiming for a single large landmass, but even that should be possible to stitch I think.

While having done web development for many years, this kind of coding is still a bit new to me, so lots of new things to orient myself around with algorithm efficiency, when to use which one and so on.

My benchmarks for cargo bench are very similar to yours it seems:

very_large/5,000,000 random sites                                        
                        time:   [6.0275 s 6.0991 s 6.1641 s]
very_large/10,000,000 random sites
                        time:   [13.544 s 13.719 s 13.887 s]

Here is the full log in case it is handy bench.log

By the way, which tools to you use to profiling rust code, flamegraph, or are there other better ones?

andreesteve commented 1 year ago

By the way, which tools to you use to profiling rust code, flamegraph, or are there other better ones?

I haven't used flame-graph, but skimming through it, looks like a GUI for the report from perf.

Here's how I am profiling the library (but run the command below once first without perf to avoid profiling the compiler):

RUSTFLAGS=-g perf record --call-graph=dwarf cargo run --release --example svg -- -s 5000000

The RUSTFLAGS=-g is to make the compiler add symbols, since I am compiling for release and don't want to change the cargo file.

You can then get the call graph with associated cpu samples running:

perf report

For more, see https://nnethercote.github.io/perf-book/profiling.html

So, like I alluded before, most of the cost is on the library that generates the delaunay graph. We might need to go there to figure out whether it makes sense to add multi-thread support there, or perhaps come up with a new library.

On voronoice, as shown below, some time is spent in memory allocations, and some time on EdgesAroundSiteIterator. I guess some of that is parallelizable, but really, we won't see much gains on total time, since that is not where the bulk of time is spent.

-   82.38%     0.18%  svg      svg                   [.] voronoice::Voronoi::new                                                                         ▒
   - 82.20% voronoice::Voronoi::new                                                                                                                      ▒
      - 53.18% delaunator::triangulate                                                                                                                   ▒
         + 27.60% delaunator::Triangulation::legalize                                                                                                    ▒
         + 7.41% delaunator::Hull::find_visible_edge (inlined)                                                                                           ▒
         + 3.54% delaunator::sortf (inlined)                                                                                                             ▒
         + 2.11% delaunator::Triangulation::add_triangle                                                                                                 ▒
         + 1.43% delaunator::Point::orient (inlined)                                                                                                     ▒
         + 1.37% delaunator::Hull::hash_edge (inlined)                                                                                                   ▒
           0.95% delaunator::find_seed_triangle (inlined)                                                                                                ▒
      - 20.05% voronoice::cell_builder::CellBuilder::build                                                                                               ▒
         - 20.05% voronoice::cell_builder::CellBuilder::build_cells (inlined)                                                                            ▒
            - 7.45% <alloc::vec::Vec<T,A> as core::iter::traits::collect::Extend<T>>::extend (inlined)                                                   ▒
                 <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend (inlined)                                               ▒
               + alloc::vec::Vec<T,A>::extend_desugared (inlined)                                                                                        ▒
            - 6.62% voronoice::cell_builder::CellBuilder::clip_cell                                                                                      ▒
               + 4.61% alloc::vec::Vec<T,A>::push (inlined)                                                                                              ▒
               + 0.76% <core::iter::adapters::take::Take<I> as core::iter::traits::iterator::Iterator>::next (inlined)                                   ▒
                 0.51% voronoice::cell_builder::CellBuilder::is_vertex_inside_bounding_box (inlined)                                                     ▒
              1.85% voronoice::utils::site_of_incoming (inlined)                                                                                         ▒
      + 5.41% voronoice::cell_builder::CellBuilder::new                                                                                                  ▒
      + 3.56% core::iter::traits::iterator::Iterator::collect (inlined)