I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message. In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.

Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the eddl_cs_mem parameter to low_mem. With the modified text_generation.py:

the error occurs with full_mem or mid_mem;
when using low_mem, I get a segmentation fault with messages that may differ between two consecutive runs:
```
[...]
Recurrent net output sequence length=20
munmap_chunk(): invalid pointer
Aborted (core dumped)
```

[...] Recurrent net output sequence length=20 Segmentation fault (core dumped)

Have a look at the following logs. They correspond to the output of five and three consecutive executions with the flag `--gpu` of the script without touching the Python code for, respectively, `eddl_cs_mem=full_mem` and `eddl_cs_mem=mid_mem`. After it fails, it keeps failing for a while, then it runs fine again.

### FIVE FOR "FULL MEM"

** FULL MEM, FIRST:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.315 metric=0.271] 1.8908 secs/batch 3.7816 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

** FULL MEM, SECOND:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.257] 1.7958 secs/batch 3.5917 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


** FULL MEM, THIRD: ***

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.336 metric=0.242] 1.7833 secs/batch 3.5667 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

**FULL MEM, FOURTH:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.357 metric=0.243] 1.9022 secs/batch 3.8044 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


**FULL MEM, FIFTH:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.343 metric=0.254] 1.7141 secs/batch 3.4281 secs/epoch about to export

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️

Traceback (most recent call last): File "text_generation.py", line 84, in main(parser.parse_args(sys.argv[1:])) File "text_generation.py", line 71, in main eddl.save_net_to_onnx_file(net, "img2text.onnx") File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file return _eddl.save_net_to_onnx_file(net, path) RuntimeError: RuntimeError: ONNX::ExportNet

### THREE FOR "MID MEM"

Using MID_MEM the behaviour is the same: the first two runs are ok, the third fails.
**MID_MEM, FIRST**

eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.246] 1.7054 secs/batch 3.4108 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


**MID_MEM, SECOND**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.345 metric=0.272] 1.8341 secs/batch 3.6682 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

**MID_MEM, THIRD:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.327 metric=0.240] 1.9775 secs/batch 3.9549 secs/epoch about to export

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️


# SCRIPT (MOD. TEXT GENERATION)
```python

"""\
Text generation (modified).
"""

import argparse
import sys

import pyeddl.eddl as eddl
from pyeddl.tensor import Tensor
import numpy as np

MEM_CHOICES = ("low_mem", "mid_mem", "full_mem")

def main(args):
    epochs = 1
    olength = 20
    outvs = 2000
    embdim = 32

    # True: remove last layers and set new top = flatten
    # new input_size: [3, 256, 256] (from [224, 224, 3])
    net = eddl.download_resnet18(True, [3, 256, 256])
    lreshape = eddl.getLayer(net, "top")
    dense_layer = eddl.HeUniform(eddl.Dense(lreshape, 20, name="out_dense"))
    cnn_out = eddl.Sigmoid(dense_layer, name="cnn_out")
    concat = eddl.Concat([lreshape, cnn_out], name="cnn_concat")

    # create a new model from input output
    image_in = eddl.getLayer(net, "input")

    # Decoder
    ldecin = eddl.Input([outvs])
    ldec = eddl.ReduceArgMax(ldecin, [0])
    ldec = eddl.RandomUniform(
        eddl.Embedding(ldec, outvs, 1, embdim, True), -0.05, 0.05
    )

    ldec = eddl.Concat([ldec, concat])
    layer = eddl.LSTM(ldec, 512, True)
    out = eddl.Softmax(eddl.Dense(layer, outvs), name="out_cnn")

    eddl.setDecoder(ldecin)
    net = eddl.Model([image_in], [out])

    # Build model
    eddl.build(
        net,
        eddl.adam(0.01),
        ["softmax_cross_entropy"],
        ["accuracy"],
        eddl.CS_GPU(mem=args.mem) if args.gpu else eddl.CS_CPU(mem=args.mem)
    )
    eddl.summary(net)

    # Load dataset
    x_train = Tensor.randn([48, 256, 256, 3])  # Tensor.load("flickr_trX.bin", "bin")
    y_train = Tensor.fromarray( np.random.randint(0,2,(48,20)) )

    xtrain = Tensor.permute(x_train, [0, 3, 1, 2])
    y_train = Tensor.onehot(y_train, outvs)
    # batch x timesteps x input_dim
    y_train.reshape_([y_train.shape[0], olength, outvs])

    eddl.fit(net, [xtrain], [y_train], args.batch_size, epochs)
    # eddl.save(net, "img2text.bin", "bin")
    print("about to export")
    eddl.save_net_to_onnx_file(net, "img2text.onnx")

    # error here
    print("All done")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--batch-size", type=int, metavar="INT", default=24)
    parser.add_argument("--gpu", action="store_true")
    parser.add_argument("--small", action="store_true")
    # crashes with a segfault on low_mem
    parser.add_argument("--mem", metavar="|".join(MEM_CHOICES),
                        choices=MEM_CHOICES, default="full_mem")
    main(parser.parse_args(sys.argv[1:]))

Python:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python --version
Python 3.8.6

nVidia/CUDA:

NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6

Libraries:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep eddl
name: eddl2
  - eddl-cudnn=1.1b0=h476a1fd_0
  - pyeddl-cudnn=1.3.0=py38hf64f055_0
prefix: /root/miniconda3/envs/eddl2
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep ecvl
  - ecvl-cudnn=1.0.3=py38h65a929d_0
  - pyecvl-cudnn=1.3.0=py38hf64f055_0

Test run on a linux pod running on the OpenDeepHealth platform:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

deephealthproject / pyeddl

Export to onnx "randomly fails" #78

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

model

[CUT DUE TO GITHUB LIMITS]

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️