Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
234 stars 87 forks source link

Regressions from #1604 #1694

Open ronawho opened 2 years ago

ronawho commented 2 years ago

1604 ("arkouda_server_refactor with metrics integration") has caused some correctness and performance issues.

16-node-cs-hdr performance is down significantly with argsort going from 15 to 10 GiB/s.

16-node-xc hasn't completed in a few days. It looks like the server isn't actually shutting down so we're left with orphaned servers taking up all the compute nodes.

hokiegeek2 commented 2 years ago

@ronawho determined the 16-node-xc test failures are due to a Chapel bug. I am submitting a patch soon that will prevent invocation of exit(0) unless there are 2..n ServerDaemons, meaning that the default arkouda_server configuration will not call exit(0). Fixing the root cause will be the subject of another issue.

ronawho commented 2 years ago

Thanks to some digging from @bmcdonald3 we believe the perf regression is caused by an existing false-sharing issue and this PR just happens to change allocations in a way that exposed it.

We've seen things like like before in https://github.com/Bears-R-Us/arkouda/issues/982 and https://github.com/Bears-R-Us/arkouda/pull/783/commits/c668ef5f331e170c87a844fc4124f8424fc301a3.

I'll start digging into the root cause of the false-sharing. In the meantime I can probably come up with a patch that gives us the current functionality but changes allocation pattens to avoid it in nightly.

Ethan-DeBandi99 commented 1 year ago

@ronawho, @hokiegeek2 - is this issue still needed, or has it been addressed through other work?

ronawho commented 1 year ago

Still needed. This is likely caused by false-sharing and requires chapel changes to address (still on my TODO list, but pretty far back)

hokiegeek2 commented 1 year ago

@ronawho the performance regression you reported is pretty significant, isn't it? Will the false-sharing fix resolve it?

ronawho commented 1 year ago

Yeah, I believe so

Not captured in this issue, but we discussed this a bit at the in-person meetup. The basic problem is small changes in memory layout can cause unrelated variables/bytes on the same cacheline to "conflict" with each other, which can lead to false-sharing and this type of variable performance. Note that in nightly we're hovering around 14-16 GiB/s and not the 10 we saw right after this issue was opened, so the perf hit is not that large due to subsequent layout changes. And in general I'm also not worried about this for production workflows, this is mostly impacting very simple benchmarks.

hokiegeek2 commented 1 year ago

@ronawho gotcha, cool, thanks for the explanation. If you're not worried about it, that works for me.

hokiegeek2 commented 1 year ago

@ronawho did you ever fix the Chapel false sharing bug that is the root cause of these regressions you reported?

ronawho commented 1 year ago

No, still on the queue