IBM / terratorch

a Python toolkit for fine-tuning Geospatial Foundation Models (GFMs).
Apache License 2.0
170 stars 24 forks source link

Allow/test usage in PBS/Slurm #144

Open romeokienzler opened 2 months ago

romeokienzler commented 2 months ago

Is your feature request related to a problem? Please describe. TT not usable via SLURM/PBS

Describe the solution you'd like Allow/test usage in PBS/Slurm

reported by @biancazadrozny

Joao-L-S-Almeida commented 2 months ago

I can test it, but I need to have access to a SLURM-based resource.

Foxigod commented 1 month ago

I believe I've almost exclusively run terratorch through SLURM. Can you @romeokienzler elaborate on this issue?

biancazadrozny commented 1 month ago

@Foxigod Have you used multiple GPUs?

Foxigod commented 1 month ago

@biancazadrozny Ahh, yes I have, but I did need to modify my submission script. In my .yaml config I kept the trainer.devices: auto as the examples did, but I needed to explicitly export CUDA_VISIBLE_DEVICES=0,1,2,3 before calling terratorch fit ... to run on the 4 GPUs I have per node (or at least that seemed to work, irrespective of its necessity). With terratorch I never gave multi-node experiments a try however, so I can't speak for that part.

Foxigod commented 1 month ago

I just experimented with 2 nodes, and it seems to have worked. I have 4 GPUs per node, and this was amongst the printouts

0: ---------------------------------------------------------------------------------------------------- 0: distributed_backend=nccl 0: All distributed processes registered. Starting with 8 processes 0: ----------------------------------------------------------------------------------------------------

romeokienzler commented 1 month ago

@takaomoriyama can u pse verify and repeat the scale out and scale up tests u did on ccc?

takaomoriyama commented 1 month ago

@romeokienzler Sure!

romeokienzler commented 1 month ago

Samy from FZJ managed to run TT on jewles using Slurm

romeokienzler commented 1 month ago

@takaomoriyama has access to jewles now, implementing...

romeokienzler commented 2 weeks ago

related to #146

takaomoriyama commented 1 week ago

Created a batch script for Slurm https://github.com/IBM/terratorch/pull/234. The current result of sen1floods11_vit workload.

<num_nodes> x <num_gpus> - Execution time / error
------------------------------------------------------
1x1 - 175m32.861s
1x2 - 87m59.467s
1x4 - 46m28.186s
2x1 - Network error: TCPStore() (1/2 clients has joined)
2x2 - 48m45.038s
4x4 - Network error: TCPStore() (4/16 clients has joined)
8x4 - Error: No training batch
16x4 - Error: No training batch
32x4 - Network error: TCPStore() (80/128 clients has joined)

So far, scaling up to 4 GPUs is OK, but suffering from two issues: intermittent network error and no training batch error.

romeokienzler commented 1 week ago

@MinasMayth helps to get a contact at FZJ to re-run with more data...

romeokienzler commented 1 week ago

@takaomoriyama to re-run (because of outage)

takaomoriyama commented 4 days ago

There reasons of errors in the table above.

[No data error]

8x4 - Error: No training batch
16x4 - Error: No training batch

These errors occurred because enough number of batches were not available for nodes. Sen1floods11 data contains 252 files for training, and default batch_size is 16. So we have 252 / 16 = 16 batches. So if we have more than 16 tasks. I adjusted batch size so that all task will receive at least one batch. The result will be shown next comment.

[TCPStore() error] This still occurs intermittently.

MinasMayth commented 4 days ago

Have not found anyone else who has been running terratorch at JSC except for @Foxigod. We are unsure how the nodes exactly communicate, i.e. if it is based on some node being recognized as a sort of "root-node" or not. If the underlying method used here doesn't account for the infiniband islands, then those could explain the TCPStore() error (suggested by Eli).

takaomoriyama commented 4 days ago

Result sen1floods11_vit workload with batch_size 1 (terratorch --data.init_args.batch_size 1) Ran on JUWELS cluster at FZJ.

<num_nodes>x<num_gpus> #batch/task  Execn time  Speed up
--------------------------------------------------------
          1x1                  252  87m17.542s     1.00x
          1x2                  126  50m49.549s     1.72x
          1x4                   63  29m34.590s     2.95x
          2x1                  126  59m43.249s     1.46x
          2x2                   63  30m38.232s     2.85x
          2x4                   32  19m43.904s     4.42x
          4x1                   63  35m14.097s     2.48x
          4x4                   16  13m25.957s     6.50x
          8x4                    8  11m11.514s     7.80x
         16x4                    4   9m27.094s     9.24x
         32x4                    2   9m53.797s     8.82x
romeokienzler commented 4 days ago

@takaomoriyama to test with larger data set, other than sen1floods11_vit

please ask @blumenstiel or @paolofraccaro - they have most probably other datasets, e.g., major tom or similar