deephealthproject / pyeddl

Python wrapper for the EDDL library.
MIT License
13 stars 2 forks source link

Export to onnx "randomly fails" #78

Open thistlillo opened 2 years ago

thistlillo commented 2 years ago

I am experiencing a strange issue when I export to ONNX networks that include a recurrent layer: sometimes the export fails. I was partially able to replicate the issue by modifying the text_generation example on the PyEDLL website. You may find the script at the end of this message. In all cases (at least with the actual UC5 code), if the first export to ONNX succeeds, then it will never fail till the end of the training.

Why partially able? With my actual code (UC5) I do not get any segmentation fault when I change the eddl_cs_mem parameter to low_mem. With the modified text_generation.py:

[...] Recurrent net output sequence length=20 Segmentation fault (core dumped)

Have a look at the following logs. They correspond to the output of five and three consecutive executions with the flag `--gpu` of the script without touching the Python code for, respectively, `eddl_cs_mem=full_mem` and `eddl_cs_mem=mid_mem`. After it fails, it keeps failing for a while, then it runs fine again.

### FIVE FOR "FULL MEM"

** FULL MEM, FIRST:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.315 metric=0.271] 1.8908 secs/batch 3.7816 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

** FULL MEM, SECOND:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.257] 1.7958 secs/batch 3.5917 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


** FULL MEM, THIRD: ***

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.336 metric=0.242] 1.7833 secs/batch 3.5667 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

**FULL MEM, FOURTH:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.357 metric=0.243] 1.9022 secs/batch 3.8044 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


**FULL MEM, FIFTH:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with full memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with full memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.343 metric=0.254] 1.7141 secs/batch 3.4281 secs/epoch about to export

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️

Traceback (most recent call last): File "text_generation.py", line 84, in main(parser.parse_args(sys.argv[1:])) File "text_generation.py", line 71, in main eddl.save_net_to_onnx_file(net, "img2text.onnx") File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file return _eddl.save_net_to_onnx_file(net, path) RuntimeError: RuntimeError: ONNX::ExportNet

### THREE FOR "MID MEM"

Using MID_MEM the behaviour is the same: the first two runs are ok, the third fails.
**MID_MEM, FIRST**

eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.347 metric=0.246] 1.7054 secs/batch 3.4108 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done


**MID_MEM, SECOND**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.345 metric=0.272] 1.8341 secs/batch 3.6682 secs/epoch about to export [ONNX::Export] Warning: The LSTM layer LSTM1 has mask_zeros=true. This attribute is not supported in ONNX, so the model exported will not have this attribute. All done

**MID_MEM, THIRD:**

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python text_generation.py --gpu Downloading resnet18.onnx resnet18.onnx ✓ Import ONNX... Generating Random Table removing resnetv15_dense0_fwd Warning: output layer has been removed CS with mid memory setup Building model Selecting GPU device 0 EDDL is running on GPU device 0, Tesla V100-SXM2-32GB CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB

model

[CUT DUE TO GITHUB LIMITS]

Total params: 14496868 Trainable params: 14487268 Non-trainable params: 9600

Vec2Seq 1 to 20 Recurrent net output sequence length=20 CS with mid memory setup Building model without initialization Unroll on device Recurrent net output sequence length=20 1 epochs of 2 batches of size 24 Epoch 1 [██████████████████████████████████████████████████] 2 out_cnn[loss=4.327 metric=0.240] 1.9775 secs/batch 3.9549 secs/epoch about to export

⚠️ Error exporting the merge layer concat1. To export this model you need to provide the 'seq_len' argument with a value higher than 0 in the export function. (ONNX::ExportNet) ⚠️

Traceback (most recent call last): File "text_generation.py", line 84, in main(parser.parse_args(sys.argv[1:])) File "text_generation.py", line 71, in main eddl.save_net_to_onnx_file(net, "img2text.onnx") File "/root/miniconda3/envs/eddl2/lib/python3.8/site-packages/pyeddl/eddl.py", line 2894, in save_net_to_onnx_file return _eddl.save_net_to_onnx_file(net, path) RuntimeError: RuntimeError: ONNX::ExportNet (eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api#


# SCRIPT (MOD. TEXT GENERATION)
```python

"""\
Text generation (modified).
"""

import argparse
import sys

import pyeddl.eddl as eddl
from pyeddl.tensor import Tensor
import numpy as np

MEM_CHOICES = ("low_mem", "mid_mem", "full_mem")

def main(args):
    epochs = 1
    olength = 20
    outvs = 2000
    embdim = 32

    # True: remove last layers and set new top = flatten
    # new input_size: [3, 256, 256] (from [224, 224, 3])
    net = eddl.download_resnet18(True, [3, 256, 256])
    lreshape = eddl.getLayer(net, "top")
    dense_layer = eddl.HeUniform(eddl.Dense(lreshape, 20, name="out_dense"))
    cnn_out = eddl.Sigmoid(dense_layer, name="cnn_out")
    concat = eddl.Concat([lreshape, cnn_out], name="cnn_concat")

    # create a new model from input output
    image_in = eddl.getLayer(net, "input")

    # Decoder
    ldecin = eddl.Input([outvs])
    ldec = eddl.ReduceArgMax(ldecin, [0])
    ldec = eddl.RandomUniform(
        eddl.Embedding(ldec, outvs, 1, embdim, True), -0.05, 0.05
    )

    ldec = eddl.Concat([ldec, concat])
    layer = eddl.LSTM(ldec, 512, True)
    out = eddl.Softmax(eddl.Dense(layer, outvs), name="out_cnn")

    eddl.setDecoder(ldecin)
    net = eddl.Model([image_in], [out])

    # Build model
    eddl.build(
        net,
        eddl.adam(0.01),
        ["softmax_cross_entropy"],
        ["accuracy"],
        eddl.CS_GPU(mem=args.mem) if args.gpu else eddl.CS_CPU(mem=args.mem)
    )
    eddl.summary(net)

    # Load dataset
    x_train = Tensor.randn([48, 256, 256, 3])  # Tensor.load("flickr_trX.bin", "bin")
    y_train = Tensor.fromarray( np.random.randint(0,2,(48,20)) )

    xtrain = Tensor.permute(x_train, [0, 3, 1, 2])
    y_train = Tensor.onehot(y_train, outvs)
    # batch x timesteps x input_dim
    y_train.reshape_([y_train.shape[0], olength, outvs])

    eddl.fit(net, [xtrain], [y_train], args.batch_size, epochs)
    # eddl.save(net, "img2text.bin", "bin")
    print("about to export")
    eddl.save_net_to_onnx_file(net, "img2text.onnx")

    # error here
    print("All done")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--batch-size", type=int, metavar="INT", default=24)
    parser.add_argument("--gpu", action="store_true")
    parser.add_argument("--small", action="store_true")
    # crashes with a segfault on low_mem
    parser.add_argument("--mem", metavar="|".join(MEM_CHOICES),
                        choices=MEM_CHOICES, default="full_mem")
    main(parser.parse_args(sys.argv[1:]))

Python:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# python --version
Python 3.8.6

nVidia/CUDA:

NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6 

Libraries:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep eddl
name: eddl2
  - eddl-cudnn=1.1b0=h476a1fd_0
  - pyeddl-cudnn=1.3.0=py38hf64f055_0
prefix: /root/miniconda3/envs/eddl2
(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# conda env export | grep ecvl
  - ecvl-cudnn=1.0.3=py38h65a929d_0
  - pyecvl-cudnn=1.3.0=py38hf64f055_0

Test run on a linux pod running on the OpenDeepHealth platform:

(eddl2) root@uc5-pipeline-pylibs-cudnn-79586ff6b7-6zh95:/mnt/datasets/uc5/UC5_last/src/example-api# cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
simleo commented 2 years ago

The Python version of save_net_to_onnx_file just calls the corresponding C++ function. The only difference is that in PyEDDL 1.3.0 the seq_len argument was not exposed to the Python interface, but that has no effect on the function's behavior when the argument is not used. You should report this to the EDDL team. You can also try again with PyEDDL 1.3.1 (note it's just been released, so Docker images and Conda packages are not available yet) and see if setting seq_len makes any difference.

thistlillo commented 2 years ago

Thanks, @simleo. I will open an issue on the EDDL site.