Open thistlillo opened 2 years ago
The Python version of save_net_to_onnx_file
just calls the corresponding C++ function. The only difference is that in PyEDDL 1.3.0 the seq_len
argument was not exposed to the Python interface, but that has no effect on the function's behavior when the argument is not used. You should report this to the EDDL team. You can also try again with PyEDDL 1.3.1 (note it's just been released, so Docker images and Conda packages are not available yet) and see if setting seq_len
makes any difference.
Thanks, @simleo. I will open an issue on the EDDL site.
I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message. In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.
Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the
eddl_cs_mem
parameter tolow_mem
. With the modifiedtext_generation.py
:full_mem
ormid_mem
;low_mem
, I get asegmentation fault
with messages that may differ between two consecutive runs:[...] Recurrent net output sequence length=20 Segmentation fault (core dumped)
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.315 metric=0.271] 1.8908 secs/batch 3.7816 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.257] 1.7958 secs/batch 3.5917 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.336 metric=0.242] 1.7833 secs/batch 3.5667 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.357 metric=0.243] 1.9022 secs/batch 3.8044 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.343 metric=0.254] 1.7141 secs/batch 3.4281 secs/epoch about to export
⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
Traceback (most recent call last): File "text_generation.py", line 84, in
main(parser.parse_args(sys.argv[1:]))
File "text_generation.py", line 71, in main
eddl.save_net_to_onnx_file(net, "img2text.onnx")
File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet
eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.246] 1.7054 secs/batch 3.4108 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.345 metric=0.272] 1.8341 secs/batch 3.6682 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
model
[CUT DUE TO GITHUB LIMITS]
Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600
Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.327 metric=0.240] 1.9775 secs/batch 3.9549 secs/epoch about to export
⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️
Traceback (most recent call last): File "text_generation.py", line 84, in
main(parser.parse_args(sys.argv[1:]))
File "text_generation.py", line 71, in main
eddl.save_net_to_onnx_file(net, "img2text.onnx")
File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file
return _eddl.save_net_to_onnx_file(net, path)
RuntimeError: RuntimeError: ONNX::ExportNet
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api#
Python:
nVidia/CUDA:
Libraries:
Test run on a linux pod running on the OpenDeepHealth platform: