Open romeokienzler opened 2 months ago
I can test it, but I need to have access to a SLURM-based resource.
I believe I've almost exclusively run terratorch through SLURM. Can you @romeokienzler elaborate on this issue?
@Foxigod Have you used multiple GPUs?
@biancazadrozny Ahh, yes I have, but I did need to modify my submission script.
In my .yaml config I kept the trainer.devices: auto
as the examples did, but I needed to explicitly export CUDA_VISIBLE_DEVICES=0,1,2,3
before calling terratorch fit ...
to run on the 4 GPUs I have per node (or at least that seemed to work, irrespective of its necessity).
With terratorch I never gave multi-node experiments a try however, so I can't speak for that part.
I just experimented with 2 nodes, and it seems to have worked. I have 4 GPUs per node, and this was amongst the printouts
0: ---------------------------------------------------------------------------------------------------- 0: distributed_backend=nccl 0: All distributed processes registered. Starting with 8 processes 0: ----------------------------------------------------------------------------------------------------
@takaomoriyama can u pse verify and repeat the scale out and scale up tests u did on ccc?
@romeokienzler Sure!
Samy from FZJ managed to run TT on jewles using Slurm
@takaomoriyama has access to jewles now, implementing...
related to #146
Created a batch script for Slurm https://github.com/IBM/terratorch/pull/234. The current result of sen1floods11_vit workload.
<num_nodes> x <num_gpus> - Execution time / error
------------------------------------------------------
1x1 - 175m32.861s
1x2 - 87m59.467s
1x4 - 46m28.186s
2x1 - Network error: TCPStore() (1/2 clients has joined)
2x2 - 48m45.038s
4x4 - Network error: TCPStore() (4/16 clients has joined)
8x4 - Error: No training batch
16x4 - Error: No training batch
32x4 - Network error: TCPStore() (80/128 clients has joined)
So far, scaling up to 4 GPUs is OK, but suffering from two issues: intermittent network error and no training batch error.
@MinasMayth helps to get a contact at FZJ to re-run with more data...
@takaomoriyama to re-run (because of outage)
There reasons of errors in the table above.
[No data error]
8x4 - Error: No training batch
16x4 - Error: No training batch
These errors occurred because enough number of batches were not available for nodes. Sen1floods11 data contains 252 files for training, and default batch_size is 16. So we have 252 / 16 = 16 batches. So if we have more than 16 tasks. I adjusted batch size so that all task will receive at least one batch. The result will be shown next comment.
[TCPStore() error] This still occurs intermittently.
Have not found anyone else who has been running terratorch at JSC except for @Foxigod. We are unsure how the nodes exactly communicate, i.e. if it is based on some node being recognized as a sort of "root-node" or not. If the underlying method used here doesn't account for the infiniband islands, then those could explain the TCPStore() error (suggested by Eli).
Result sen1floods11_vit workload with batch_size 1 (terratorch --data.init_args.batch_size 1
)
Ran on JUWELS cluster at FZJ.
<num_nodes>x<num_gpus> #batch/task Execn time Speed up
--------------------------------------------------------
1x1 252 87m17.542s 1.00x
1x2 126 50m49.549s 1.72x
1x4 63 29m34.590s 2.95x
2x1 126 59m43.249s 1.46x
2x2 63 30m38.232s 2.85x
2x4 32 19m43.904s 4.42x
4x1 63 35m14.097s 2.48x
4x4 16 13m25.957s 6.50x
8x4 8 11m11.514s 7.80x
16x4 4 9m27.094s 9.24x
32x4 2 9m53.797s 8.82x
@takaomoriyama to test with larger data set, other than sen1floods11_vit
please ask @blumenstiel or @paolofraccaro - they have most probably other datasets, e.g., major tom or similar
Is your feature request related to a problem? Please describe. TT not usable via SLURM/PBS
Describe the solution you'd like Allow/test usage in PBS/Slurm
reported by @biancazadrozny