WaDelma commented 6 years ago

Improves parallel iterators splitting behaviour and refactors it.

The improvement: Before 1000000011111111 was split to 10000000 and to 11111111 After 10001111 is split to 10000000111 and to 11111

This change is

torkleyy commented 6 years ago

This looks really interesting! I can't say when I have time to review it, in case I forget it feel free to ping me.

torkleyy commented 6 years ago

@csherratt I would really appreciate if you could also take a look at this PR. @WaDelma Given that I'm very busy at the moment, it's possible that I can't review this until Friday or Saturday.

WaDelma commented 6 years ago

Np, I am not in a hurry!

WaDelma commented 6 years ago

I would do benchmarking, but my main PC died and I am using crummy laptop.

torkleyy commented 6 years ago

This PR looks good to me, thank you! Would still be nice if @kvark or @csherratt could take a look.

Reviewed 2 of 2 files at r4. Review status: all files reviewed at latest revision, 1 unresolved discussion.

Comments from Reviewable

kvark commented 6 years ago

What difference does the benchmark show? If not, we should have a benchmark that does.

torkleyy commented 6 years ago

Ah right, we need a benchmark!

torkleyy commented 6 years ago

I asked for one first, but forgot it then.

WaDelma commented 6 years ago

Benchmark results on AMD Ryzen 5 1600 (6 cores / 12 SMT x 3.2 GHz) 64bit Fedora 27

par_n is parallel iterator with splitting of n-highest levels.

There seems to be quite bit of overhead for parallelism, but after there is enough bits it's more efficient.

I didn't notice any change between the old splitting strategy and the new one, but the new one should work better if the bitset is unevenly distributed. (I should probably figure out benchmark for that).

	seq	Par_3	Par_2	Par_1
100	690	19850	19045	9721
1000	5611	34946	27457	12587
10000	48782	56505	34214	32321
100000	331948	101915	66843	159194
1000000	1162002	234203	201314	614811

log-log scale iter benchmarks linear scale

WaDelma commented 6 years ago

After sleeping the night I also realised that that benchmark is flawed: The per iteration computation is minimal, which means that it's measuring mostly the overhead of rayon.

I need to make benchmark with more realistic per iteration computation as it should make splitting on all layers more efficient. I should also probably benchmark it with unbalanced per iteration computation too.

WaDelma commented 6 years ago

I benchmarked different levels of splitting with different payloads. It seems that if there is per entity work then there is less difference between how many layers splitting is done. I reverted it to split to all 3 levels as it seems more stable and efficient when you have lots of entities (and I think in some rayon blog it was mentioned that you should split as much as you can... Not sure about this though).

Benchmarking is annoying. I just realised I should probably also benchmark with different amount of cores.

	Par_2	Par_3	Par_2_payload_100	Par_3_payload_100	Par_2_payload_1000	Par_3_payload_1000	Par_2_payload_10000	Par_3_payload_10000
100	19048	20035	20771	21419	25701	30633	87458	126330
1000	27744	35203	34103	43724	130098	100184	1063586	647348
10000	34135	57760	142638	135950	1050928	684282	6240367	6146004
100000	67404	100416	800828	818030	6189107	6077761	59545648	58536681
1000000	200724	235456	7727451	5184305	41417368	40561180	403868928	393994208

Parallel payload iteration benchmark

torkleyy commented 6 years ago

@WaDelma what's the status of this PR?

Review status: 0 of 3 files reviewed at latest revision, 5 unresolved discussions.

Comments from Reviewable

WaDelma commented 6 years ago

As long as you are happy with my benchmarking and don't find anything else wrong with the PR, I think this should be ready.

torkleyy commented 6 years ago

@kvark Are you happy with this PR? Or do we still need some more diagrams and comments?

WaDelma commented 6 years ago

Whoops... Tried to clear a comment and accidentally clicked Close and comment.

torkleyy commented 6 years ago

As @kvark said it would be nice to have the commits squashed or reorganized. Otherwise this is really great, so please r=me once you enhanced the commits.

bors delegate=WaDelma

bors[bot] commented 6 years ago

:v: WaDelma can now approve this pull request

WaDelma commented 6 years ago

Should I remove some of the benchmarks? They seem to slow CI down.

Also I am not that happy with average_ones as it's results are bit unintuitive: 01001 is split 01, 001 and not 0100, 1. It doesn't affect it's use, but it makes for a weird API and explanation. Dunno if it's worth changing.

kvark commented 6 years ago

Follow up is fine. As for benches - I don't think we should rely on CI anyway, so we can turn the par iter off there.

On Jan 2, 2018, at 04:30, WaDelma notifications@github.com wrote:

Should I remove some of the benchmarks? They seem to slow CI down.

Also I am not that happy with average_ones as it's results are bit unintuitive: 01001 is split 01, 001 and not 0100, 1. It doesn't affect it's use, but it makes for weird API and explanation. Dunno if it's worth changing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

WaDelma commented 6 years ago

bors r=me

torkleyy commented 6 years ago

@WaDelma You can approve a PR for me by writing "bors r=torkleyy". (That's what I meant by saying "r=me").

WaDelma commented 6 years ago

ah, sorry. Don't really know how bors works :P

WaDelma commented 6 years ago

bors r-

bors[bot] commented 6 years ago

Canceled

WaDelma commented 6 years ago

bors r=torkleyy

bors[bot] commented 6 years ago

Build succeeded

continuous-integration/travis-ci/push

amethyst / hibitset

Improve parallel iterators splitting behaviour #19

Canceled

Build succeeded