Distributed training - Githubissues

pradhyumna85 commented 1 year ago

Is it possible to do Distributed training on multiple GPUs and machines using SciANN? Like can something like horovod, tf distributed etc be used readily?

ehsanhaghighat commented 1 year ago

It should be possible since backend is all Keras but i have never worked on it.

On Mar 26, 2023, at 5:42 AM, Pradyumna Singh Rathore @.***> wrote:

Is it possible to do Distributed training on multiple GPUs and machines using SciANN? Like can something like horovod, tf distributed etc be used readily?

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYL7UACYUM5GJ3MVKTW6A2UVANCNFSM6AAAAAAWIFDNB4. You are receiving this because you are subscribed to this thread.

pradhyumna85 commented 1 year ago

Hi @ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN's custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.

I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

ehsanhaghighat commented 1 year ago

Wow this is awesome news! Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

ehsanhaghighat commented 1 year ago

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:

Wow this is awesome news! Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

pradhyumna85 commented 1 year ago

@ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved.

ehsanhaghighat commented 1 year ago

There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear?

On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore @.***> wrote:

@ehsanhaghighat https://github.com/ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519418277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

ehsanhaghighat commented 1 year ago

check this example: https://github.com/sciann/sciann-applications/blob/master/SciANN-Elasticity/Elasticity-Forward.ipynb

BC_left_2, BC_right_2, BC_top_2 are all Neumann type.

On Apr 23, 2023, at 10:51 PM, Ehsan Haghighat @.***> wrote:

There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear?

On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore @.***> wrote:

@ehsanhaghighat https://github.com/ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519418277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

pradhyumna85 commented 1 year ago

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:

Wow this is awesome news! Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.

pradhyumna85 commented 1 year ago

check this example: https://github.com/sciann/sciann-applications/blob/master/SciANN-Elasticity/Elasticity-Forward.ipynb BC_left_2, BC_right_2, BC_top_2 are all Neumann type. … On Apr 23, 2023, at 10:51 PM, Ehsan Haghighat @.> wrote: There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear? > On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore @.> wrote: > > > @ehsanhaghighat https://github.com/ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved. > > — > Reply to this email directly, view it on GitHub <#85 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4. > You are receiving this because you were mentioned. >

Thanks a lot @ehsanhaghighat for sharing this.

ehsanhaghighat commented 1 year ago

sciann-applications is where I usually upload all examples.

On Apr 23, 2023, at 10:55 PM, Pradyumna Singh Rathore @.***> wrote:

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:

Wow this is awesome news! Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

— Reply to this email directly, view it on GitHub #85 (comment) https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's read me we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.

— Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519421940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZ446JWP3LWNDZ64DVLXCYITTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

YunhaoYuan commented 8 months ago

Dear Pradhyumna,

Recently I am working with a large dataset and complex model which required multiple GPUs to perform distributed training. Would you please share an example or a simple demonstration on how to perform SciANN model training with horovod? I failed to do so and could not find any examples in both SciANN and SciANN application repo.

Thanks for sharing in advance!

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote: Wow this is awesome news! Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote: Hi @ehsanhaghighat https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs. I tried: Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it. — Reply to this email directly, view it on GitHub #85 (comment), or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.

ehsanhaghighat / sciann

Distributed training #85