[Question]: Am I under-utilizing my A100 box at basecalling?

ymcki commented 1 year ago

What happened?

I have a A100x4 DGX box at work. I am running workflow 1.5.2 passing "-b 256" for fixed batch size to dorado and set basecaller_chunk_size to a value that is one fourth of the number of fast5/pod5. I found that while the GPU are running at 100%. Their temperature is only around 51C to 57C. The power draw is only 194W to 252W.

I was expecting that if a GPU is fully utilized, the temperature should be around 70-80C and power draw should be very close to 300W. Am I under-utilizing my GPU? Is there anything I can do to make it run faster? Thanks a lot in advance.

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - CLI Execution Profile

None

Workflow Version

1.5.2

Relevant log output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   57C    P0   194W / 500W |  71501MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   51C    P0   252W / 500W |  71501MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   57C    P0   209W / 500W |  71501MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   52C    P0   246W / 500W |  71501MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   4051786      C   dorado                          17874MiB |
|    0   N/A  N/A   4051803      C   dorado                          17874MiB |
|    0   N/A  N/A   4051891      C   dorado                          17874MiB |
|    0   N/A  N/A   4052005      C   dorado                          17874MiB |
|    1   N/A  N/A   4051786      C   dorado                          17874MiB |
|    1   N/A  N/A   4051803      C   dorado                          17874MiB |
|    1   N/A  N/A   4051891      C   dorado                          17874MiB |
|    1   N/A  N/A   4052005      C   dorado                          17874MiB |
|    2   N/A  N/A   4051786      C   dorado                          17874MiB |
|    2   N/A  N/A   4051803      C   dorado                          17874MiB |
|    2   N/A  N/A   4051891      C   dorado                          17874MiB |
|    2   N/A  N/A   4052005      C   dorado                          17874MiB |
|    3   N/A  N/A   4051786      C   dorado                          17874MiB |
|    3   N/A  N/A   4051803      C   dorado                          17874MiB |
|    3   N/A  N/A   4051891      C   dorado                          17874MiB |
|    3   N/A  N/A   4052005      C   dorado                          17874MiB |
+-----------------------------------------------------------------------------+

SamStudio8 commented 1 year ago

Hi @ymcki! It's possible that your basecalling processes are bottlenecked by IO depending on where the data is coming from. There is also an overhead for modbase calling. There are some ongoing discussions about throughput over on the Dorado repository that you may find useful: https://github.com/nanoporetech/dorado/issues?q=performance.

ymcki commented 1 year ago

Thanks for your reply. I added the following to nextflow.config

process { withLabel:gpu { maxForks = 1 } }

and then increase basecaller_chunk_size and batch size to dorado until it fills up my VRAM. Now it is running at around 70C and about 30% faster.

epi2me-labs / wf-human-variation