Open AmitMY opened 6 months ago
Hi Amit,
That's a good question. I don't know that anyone has tested Sockeye with that many factors.
One hypothesis would be that the factor code contains cases of switching between Python/C++/GPU execution and looping over more factors leads to greater slowdown. @fhieber may have more information about decoding with factors.
For profiling, you could take a look at the PyTorch Profiler.
Best, Michael
Thanks!
One possible improvement I see, is instead of: https://github.com/awslabs/sockeye/blob/main/sockeye/model.py#L665C13-L665C79
To run the multiplications in parallel:
futures = [torch.jit.fork(fol, decoder_out) for fol in self.factor_output_layers]
outputs += [torch.jit.wait(fut) for fut in futures]
Also as a side note, in decoding, it seems like target factors are not embedded: https://github.com/awslabs/sockeye/blob/main/sockeye/model.py#L654 Am I missing something?
With the --use-cpu
flag, we get
[INFO:main] Processed 1 lines. Total time: 1.6748, sec/sent: 1.6748, sent/sec: 0.5971
Compared to an A100 GPU:
[INFO:main] Processed 1 lines. Total time: 29.1466, sec/sent: 29.1466, sent/sec: 0.0343
Since it seems like the CPU time is huge, I list the CPU timing:
Self CPU time total: 18.575s
Self CUDA time total: 27.326ms
Profile output:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
forward 96.79% 17.978s 97.03% 18.023s 667.534ms 0.000us 0.00% 12.625ms 467.593us 27
aten::linear 0.06% 11.700ms 2.21% 410.557ms 896.413us 0.000us 0.00% 13.302ms 29.044us 458
aten::matmul 0.01% 1.582ms 2.09% 388.146ms 1.578ms 0.000us 0.00% 7.126ms 28.967us 246
aten::mm 0.05% 9.657ms 2.08% 385.787ms 1.568ms 6.661ms 24.38% 7.126ms 28.967us 246
cudaFree 2.01% 373.293ms 2.01% 373.293ms 186.647ms 112.000us 0.41% 112.000us 56.000us 2
aten::repeat_interleave 0.03% 5.207ms 0.17% 32.192ms 185.011us 398.000us 1.46% 6.604ms 37.954us 174
cudaLaunchKernel 0.13% 24.834ms 0.13% 24.834ms 8.611us 1.284ms 4.70% 1.284ms 0.445us 2884
aten::dropout 0.12% 22.784ms 0.12% 22.784ms 69.463us 0.000us 0.00% 0.000us 0.000us 328
aten::layer_norm 0.08% 15.266ms 0.11% 20.798ms 127.595us 0.000us 0.00% 1.614ms 9.902us 163
aten::slice 0.08% 14.190ms 0.08% 14.244ms 11.840us 0.000us 0.00% 0.000us 0.000us 1203
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Here is a profile file, to be opened in chrome://tracing
trace.json
with torch 2.3.0, on GPU:
[INFO:main] Processed 1 lines. Total time: 2.8488, sec/sent: 2.8488, sent/sec: 0.3510
on CPU:
[INFO:main] Processed 1 lines. Total time: 1.6967, sec/sent: 1.6967, sent/sec: 0.5894
why is sockeye restricted to torch 1?
The torch
version in Sockeye's requirements.txt
(currently torch>=1.10.0,<1.14.0
) indicates the latest version of PyTorch that Sockeye is officially tested with.
If you change the line to just torch
, you can test Sockeye with the current version of PyTorch.
Best, Michael
My use case calls for splitting my input tokens to 5, and output tokens to 8. That means that the input has a token + 4 factors (SignWriting), and the output has a token + 7 factors (VQ model)
I created factored files for an example sentence:
M|c0|r0|p518|p518 S2ff|c0|r0|p482|p483
And attempt to translate, with:And the output is:
Why would translating a single sentence, with A100 GPU, on a small model, without beam search, be this slow? Is there a way to profile the decoding step function?
The full output is:
Besides the fact that the output repeats the same token over and over, it is in the expected format.