eembc / mlmark

EEMBC's Machine-Learning Inference Benchmark targeted at edge devices.
https://www.eembc.org/mlmark
Other
45 stars 5 forks source link

TensorRT ResNet50 Segfaults with Telsa T4 #4

Closed petertorelli closed 5 years ago

petertorelli commented 5 years ago

User reports that MLMark abruptly segfaults when running TensorRT target on an x86 System with a Tesla T4, and not other warning messages given. See below.

-INFO- --------------------------------------------------------------------------------
-INFO- Welcome to the EEMBC MLMark(tm) Benchmark!
-INFO- --------------------------------------------------------------------------------
-INFO- MLMark Version       : 1.0.0
-INFO- Python Version       : 3.7
-INFO- CPU Name             : GenuineIntel Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz
-INFO- Total Memory (MiB)   : 127571
-INFO- # of Logical CPUs    : 112
-INFO- Instruction Set      : x86_64
-INFO- OS Platform          : Linux-4.4.0-131-generic-x86_64-with-debian-stretch-sid
-INFO- --------------------------------------------------------------------------------
-INFO- Models in this release:
-INFO-     resnet50       : ResNet-50 v1.0 [ILSVRC2012]
-INFO-     mobilenet      : MobileNet v1.0 [ILSVRC2012]
-INFO-     ssdmobilenet   : SSD-MobileNet v1.0 [COCO2017]
-INFO- --------------------------------------------------------------------------------
-INFO- Parsing config file config/trt-gpu-resnet50-fp32-throughput.json
-INFO- Task: Target 'tensorrt', Workload 'resnet50'
-INFO-     batch                : 1
-INFO-     concurrency          : 1
-INFO-     hardware             : gpu
-INFO-     iterations           : 1024
-INFO-     mode                 : throughput
-INFO-     precision            : fp32
failed to parse uff model
Entered in engine building part
Segmentation fault (core dumped)
petertorelli commented 5 years ago

Recommend to use TF1.13.1, TRT5.1.2, CUDA10.0, and version 410 of the driver. Although issues still reported.

Deferred until TRT6 target is released in 1.0.x.

petertorelli commented 5 years ago

Appears related to these lines of code in the Net.py files for each model which import the library:

        resnetnet_lib=os.path.join(TRT_DIR,"cpp_environment","libs","libclass_resnet50.so")
        self.lib = cdll.LoadLibrary(resnetnet_lib)
        self.obj = self.lib.return_object()

Adding this line (prior to the self.lib.return_obect() call):

        self.lib.return_object.restype = ctypes.c_ulonglong

Fixes the problem on the target system. Since restype is a pointer, this was causing truncation errors. However, casting to ulonglong might introduce compatibility errors, need to investigate a pointer type instead that matches OS/arch.

petertorelli commented 5 years ago

New branch trt-restype in progress.

petertorelli commented 5 years ago

The latest two merges (#7 and #8 ) solve T4-related problems on non-Jetpack OSes.