chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

User interface for distributed matrix multiplication #15784

Open rahulghangas opened 4 years ago

rahulghangas commented 4 years ago

Just creating the issue for now, will put up specific questions subsequently

rahulghangas commented 4 years ago

@ben-albrecht need label: gsoc: linearalgebra

rahulghangas commented 4 years ago
ben-albrecht commented 4 years ago

@rahulghangas - I would suggest checking https://github.com/chapel-lang/chapel/blob/master/doc/rst/developer/bestPractices/ContributorInfo.rst#design for tips about driving design discussions forward.

One exercise that may help here is to propose answers to the questions you are listing as a starting point for others to discuss.

e-kayrakli commented 4 years ago

@rahulghangas -- the questions asked here should be answered in order for us to move forward with https://github.com/chapel-lang/chapel/pull/15976. As @ben-albrecht also recommended, I'd suggest proposing some answers/options for those questions that you asked above.

rahulghangas commented 4 years ago
Should window size be fixed or a parameter with a default value?

I think window size should be a parameter rather than a default value, where the user can use a smaller window if they face memory restrictions. However, if we want to have a default window size, I am inclined towards having the maximum value (for performance reasons). In which case, window logic can be removed completely.


Should domains of both input matrices have the same distribution?

Will address this before the second one, because my answer to it is dependent on this. I say yes, both domains should have the exact same distribution. If not, the current distributed MM logic needs to be changed to account for this (should be a minor fix).


Should the return matrix of a distributed dot be distributed or DefaultRectangular?

Yes, it should be distributed. The answer above gives rise to two branches


What should be the communication method, naive Forall or Bulk Transfer?

We're already going forward with the naive communication (currently outperforms bulk), maybe we can add performance tests for bulk as well to keep track in the nightly perf graphs.


What happens when domains of input matrices are distributed on a single locale?

We can add a small check and simpy call local dot instead of creating redundant localDomain(s) which distributed dot would.

rahulghangas commented 4 years ago

@ben-albrecht @e-kayrakli @LouisJenkinsCS @dgarvit thoughts?

e-kayrakli commented 4 years ago

I think window size should be a parameter rather than a default value, where the user can use a smaller window if they face memory restrictions. However, if we want to have a default window size, I am inclined towards having the maximum value (for performance reasons). In which case, window logic can be removed completely.

I disagree. I want the user to be able to use dot with distributed arrays without thinking about windowSize. I view this argument as a power-user-facing, and expect to not see around too much. I think this is a matter of documentation. We can explain this in the documentation of the specific argument, or as a separate note that the "distributed implementation creates copies of remote chunks and as such can consume memory. In memory-limited use cases, smaller windowSize can be used to reduce the memory overhead in expense of performance" or something.

Will address this before the second one, because my answer to it is dependent on this. I say yes, both domains should have the exact same distribution. If not, the current distributed MM logic needs to be changed to account for this (should be a minor fix).

Sounds reasonable. We can always fall back to naive matrix multiplications where we don't know how to specialize for a specific distribution. This should also probably be documented as a note. Another alternative is to give out a warning at runtime about this, but left to my own devices I'd just document it and not bother creating a warning message.

Yes, it should be distributed. The answer above gives rise to two branches

  • If input matrices have the same distribution, then the output would simply have the same distribution.
  • If they don't, maybe use the same distribution as the first matrix? I'll leave this one open-ended.

No objection from me. No strong opinions for second case either. If things are not perfect w.r.t. distriubtion alignment, I think we can do whatever as long as we document it. One option (maybe going forward and not as an initial step) is to have an overload of dot that takes the output array as an argument. This way we can allow the users to exactly choose how to distribute the output array.

We're already going forward with the naive communication (currently outperforms bulk), maybe we can add performance tests for bulk as well to keep track in the nightly perf graphs.

forall is fine. Adding performance tests would be good, but it is a side project and can happen at a different point in time.

What happens when domains of input matrices are distributed on a single locale?

Do you mean they are default rectangular? Or BlockDist but targetLocales are set to use only a single locale? There may be some differences in implementation for those cases. But for either, you can (and probably should) check whether they use the same locale. i.e. it may be two non-distributed arrays that sit on different locales. Regardless, I think this is related to the question whether the inputs should have the same distribution. And I believe these cases can also fall into a naive support, at least as a first step. Combinations of distributions are limitless, after all. An ultimate implementation of a matrix multiplication would look like bunch of if/else ifs that check for different distribution types that we can optimize for and that are followed by an else that does the naive multiplication. I think having a single if that checks for perfect alignment and an else that does the naive multiplication is a great step towards that goal. We can add else ifs in between as we discover more opportunities.

rahulghangas commented 4 years ago

I disagree. I want the user to be able to use dot with distributed arrays without thinking about windowSize. I view this argument as a power-user-facing, and expect to not see around too much. I think this is a matter of documentation. We can explain this in the documentation of the specific argument, or as a separate note that the "distributed implementation creates copies of remote chunks and as such can consume memory. In memory-limited use cases, smaller windowSize can be used to reduce the memory overhead in expense of performance" or something.

Oh right, I forgot we have already discussed this in the meetings a couple of times already. Yes, distributed dot should follow the same interface as local dot and windowSize can be an argument to a helper function.

No objection from me. No strong opinions for second case either. If things are not perfect w.r.t. distriubtion alignment, I think we can do whatever as long as we document it. One option (maybe going forward and not as an initial step) is to have an overload of dot that takes the output array as an argument. This way we can allow the users to exactly choose how to distribute the output array.

I was thinking maybe leave them distributed as is, and provide a procedure called gather that gathers up a matrix into a specific distribution, or just to the root locale if needed. This is somewhat influenced from what @dgarvit proposed in the meeting. But again, I'm not sure if this is the best appraoch.