关于tnn在海思上的运行以及bfp16的几个小问题

lq0104 commented 3 years ago

想请问一下，tnn在海思端的arm有做过专门的优化吗，比如hisi3519 如果想获得海思arm端的速度极限，是不是通过 #73 中的编译选项来做？也就是：

add_definitions( -mfpu=neon-vfpv4 -mfloat-abi=softfp )

是否还需要其他的编译选项，目前我的测试结果，在海思3519端测试mobilenetv2模型还有其他我自己训练的模型，tnn的速度跟ncnn相比更慢一些，所以想问下是否我忽略了模型加速选项

还有就是tnn中的bfp16，是跟ncnn中的bf16对标，还是ncnn中的fp16对标？我有点不太清晰，请帮忙解答一下，谢谢！

seanxcwang commented 3 years ago

armv8和armv7都做了优化，但是没有针对特定的平台，比如大小核用的是同一套代码
如果比ncnn慢一点，那也有可能，毕竟不同模型性能可能有差异，但是慢很多那肯定是有问题的
bfp16对应的就是ncnn-bf16，可能再小核上面有一些作用；fp16也对应ncnn的fp16

lq0104 commented 3 years ago

谢谢您的解答，我对ncnn和tnn在hisi3519上进行的测试，使用的ncnn版本是2020-09-30下载的，使用的tnn版本是2021-01-20下载的，ncnn测benchmark用的是https://github.com/Tencent/ncnn/tree/master/benchmark 中的benchncnn，tnn测benchmark用的是TNNTest，用arm-hisiv500-linux编译，部分编译信息如下： SHARED_LIB="ON" ARM="ON" OPENMP="ON" OPENCL="OFF" CC=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-gcc CXX=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-g++ TARGET_ARCH=arm

cmake ${TNN_ROOT_PATH} \ -DCMAKE_SYSTEM_NAME=Linux \ -DTNN_TEST_ENABLE=ON \ -DTNN_ARM_ENABLE=ON \ -DTNN_BENCHMARK_MODE=ON \ -DDEBUG=OFF \ -DCMAKE_C_COMPILER=$CC \ -DCMAKE_CXX_COMPILER=$CXX \ -DCMAKE_BUILD_TYPE=Release \ -DTNN_ARM_ENABLE:BOOL=$ARM \ -DTNN_OPENMP_ENABLE:BOOL=$OPENMP \ -DTNN_OPENCL_ENABLE:BOOL=$OPENCL \ -DCMAKE_SYSTEM_PROCESSOR=$TARGET_ARCH \ -DTNN_BUILD_SHARED:BOOL=$SHARED_LIB

并且在https://github.com/Tencent/TNN/blob/master/source/tnn/device/arm/CMakeLists.txt 中第7行增加了 add_definitions( -mfloat-abi=softfp -mfpu=neon-vfpv4 )语句，使用的模型是https://github.com/Tencent/ncnn/tree/master/benchmark 下的一些ncnn模型，目前的测试结果如下：

	ncnn th=1	ncnn th=1 bf16	ncnn th=4	ncnn th=4 bf16	tnn th=1	tnn th=1 bf16	tnn th=4	tnn th=4 bf16
squeezenet	258.85	246.45	212.63	206.88	315.016	268.05	425.011	376.259
mobilenet	420.75	392.04	338.61	310.34	468.893	367.621	516.631	422.582
mobilenet_v2	266.56	239.28	235.35	207.83	309.186	243.415	407.731	359.327
mobilenet_v3	218.04	196.95	311.14	180.81	305.233		415.109
shufflenet	146.78	138.04	316.31	303.59	171.7		209.58
shufflenet_v2	127.81	132.12	246.12	189.06	154.715		232.304
mobilenet_yolo	1900.78	2031.13	1522.74	1691.88	2123.806	1756.881	2048.268	1680.989
mobilenetv2_yolov3	938.73	856.57	767.25	690.93	1074.311	853.854	1156.992	960.54

其中有些算子tnn的bf16不支持，所以就空着了目前来看1个线程还是4个线程，bf16是否开启，ncnn的速度一般都要更快一些， ncnn的命令行参数是loop_count = 4, num_threads = 1(or 4), powersave = 0, gpu_device = -1, cooling_down = 1 tnn的命令行都是./TNNTest -mp ncnn/xxx.param -wc 8 -ic 8 -mt NCNN -th 1(or 4) -pr LOW(if bf16 enable) 您能帮我分析一下可能是哪块的原因吗？谢谢～

scarsty commented 3 years ago

谢谢您的解答，我对ncnn和tnn在hisi3519上进行的测试，使用的ncnn版本是2020-09-30下载的，使用的tnn版本是2021-01-20下载的，ncnn测benchmark用的是https://github.com/Tencent/ncnn/tree/master/benchmark 中的benchncnn，tnn测benchmark用的是TNNTest，用arm-hisiv500-linux编译，部分编译信息如下： SHARED_LIB="ON" ARM="ON" OPENMP="ON" OPENCL="OFF" CC=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-gcc CXX=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-g++ TARGET_ARCH=arm

cmake ${TNN_ROOT_PATH} -DCMAKE_SYSTEM_NAME=Linux -DTNN_TEST_ENABLE=ON -DTNN_ARM_ENABLE=ON -DTNN_BENCHMARK_MODE=ON -DDEBUG=OFF -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX -DCMAKE_BUILD_TYPE=Release -DTNN_ARM_ENABLE:BOOL=$ARM -DTNN_OPENMP_ENABLE:BOOL=$OPENMP -DTNN_OPENCL_ENABLE:BOOL=$OPENCL -DCMAKE_SYSTEM_PROCESSOR=$TARGET_ARCH -DTNN_BUILD_SHARED:BOOL=$SHARED_LIB

并且在https://github.com/Tencent/TNN/blob/master/source/tnn/device/arm/CMakeLists.txt 中第7行增加了 add_definitions( -mfloat-abi=softfp -mfpu=neon-vfpv4 )语句，使用的模型是https://github.com/Tencent/ncnn/tree/master/benchmark 下的一些ncnn模型，目前的测试结果如下：	ncnn th=1	ncnn th=1 bf16	ncnn th=4	ncnn th=4 bf16	tnn th=1	tnn th=1 bf16	tnn th=4	tnn th=4 bf16
squeezenet	258.85	246.45	212.63	206.88	315.016	268.05	425.011	376.259
mobilenet	420.75	392.04	338.61	310.34	468.893	367.621	516.631	422.582
mobilenet_v2	266.56	239.28	235.35	207.83	309.186	243.415	407.731	359.327
mobilenet_v3	218.04	196.95	311.14	180.81	305.233		415.109
shufflenet	146.78	138.04	316.31	303.59	171.7		209.58
shufflenet_v2	127.81	132.12	246.12	189.06	154.715		232.304
mobilenet_yolo	1900.78	2031.13	1522.74	1691.88	2123.806	1756.881	2048.268	1680.989
mobilenetv2_yolov3	938.73	856.57	767.25	690.93	1074.311	853.854	1156.992	960.54

其中有些算子tnn的bf16不支持，所以就空着了目前来看1个线程还是4个线程，bf16是否开启，ncnn的速度一般都要更快一些， ncnn的命令行参数是loop_count = 4, num_threads = 1(or 4), powersave = 0, gpu_device = -1, cooling_down = 1 tnn的命令行都是./TNNTest -mp ncnn/xxx.param -wc 8 -ic 8 -mt NCNN -th 1(or 4) -pr LOW(if bf16 enable) 您能帮我分析一下可能是哪块的原因吗？谢谢～

tpoisonooo commented 3 years ago

谢谢您的解答，我对ncnn和tnn在hisi3519上进行的测试，使用的ncnn版本是2020-09-30下载的，使用的tnn版本是2021-01-20下载的，ncnn测benchmark用的是https://github.com/Tencent/ncnn/tree/master/benchmark 中的benchncnn，tnn测benchmark用的是TNNTest，用arm-hisiv500-linux编译，部分编译信息如下： SHARED_LIB="ON" ARM="ON" OPENMP="ON" OPENCL="OFF" CC=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-gcc CXX=/opt/hisi-linux/x86-arm/arm-hisiv500-linux/target/bin/arm-hisiv500-linux-g++ TARGET_ARCH=arm

cmake ${TNN_ROOT_PATH} -DCMAKE_SYSTEM_NAME=Linux -DTNN_TEST_ENABLE=ON -DTNN_ARM_ENABLE=ON -DTNN_BENCHMARK_MODE=ON -DDEBUG=OFF -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX -DCMAKE_BUILD_TYPE=Release -DTNN_ARM_ENABLE:BOOL=$ARM -DTNN_OPENMP_ENABLE:BOOL=$OPENMP -DTNN_OPENCL_ENABLE:BOOL=$OPENCL -DCMAKE_SYSTEM_PROCESSOR=$TARGET_ARCH -DTNN_BUILD_SHARED:BOOL=$SHARED_LIB

并且在https://github.com/Tencent/TNN/blob/master/source/tnn/device/arm/CMakeLists.txt 中第7行增加了 add_definitions( -mfloat-abi=softfp -mfpu=neon-vfpv4 )语句，使用的模型是https://github.com/Tencent/ncnn/tree/master/benchmark 下的一些ncnn模型，目前的测试结果如下：

	ncnn th=1	ncnn th=1 bf16	ncnn th=4	ncnn th=4 bf16	tnn th=1	tnn th=1 bf16	tnn th=4	tnn th=4 bf16
squeezenet	258.85	246.45	212.63	206.88	315.016	268.05	425.011	376.259
mobilenet	420.75	392.04	338.61	310.34	468.893	367.621	516.631	422.582
mobilenet_v2	266.56	239.28	235.35	207.83	309.186	243.415	407.731	359.327
mobilenet_v3	218.04	196.95	311.14	180.81	305.233		415.109
shufflenet	146.78	138.04	316.31	303.59	171.7		209.58
shufflenet_v2	127.81	132.12	246.12	189.06	154.715		232.304
mobilenet_yolo	1900.78	2031.13	1522.74	1691.88	2123.806	1756.881	2048.268	1680.989
mobilenetv2_yolov3	938.73	856.57	767.25	690.93	1074.311	853.854	1156.992	960.54

其中有些算子tnn的bf16不支持，所以就空着了目前来看1个线程还是4个线程，bf16是否开启，ncnn的速度一般都要更快一些， ncnn的命令行参数是loop_count = 4, num_threads = 1(or 4), powersave = 0, gpu_device = -1, cooling_down = 1 tnn的命令行都是./TNNTest -mp ncnn/xxx.param -wc 8 -ic 8 -mt NCNN -th 1(or 4) -pr LOW(if bf16 enable) 您能帮我分析一下可能是哪块的原因吗？谢谢～

zchrissirhcz commented 3 years ago

不考虑用华为的bolt框架吗？ https://github.com/huawei-noah/bolt

monkeyking commented 3 years ago

差距这么大建议直接上ncnn bobo

BUG1989 commented 3 years ago

不考虑用华为的mindspore框架吗？ https://github.com/mindspore-ai/mindspore

lq0104 commented 3 years ago

谢谢大佬们的推荐～其实我感觉TNN能够直接使用ncnn的模型，还是很方便的，免去了后期挺多趟坑的弯路，如果速度更快那就更好了。华为的两个框架我下午都初步看了一下，想问一下圈圈虫大佬，mindspore框架目前对armv7支持的怎么样您那边有了解吗，我看他们主页还有github上介绍armv8提了一下，v7好像没怎么提呢

tpoisonooo commented 3 years ago

不考虑用用脸艹的 mge 么，很快的（不快俺亲自给你修）。 https://github.com/MegEngine/MegEngine

lq0104 commented 3 years ago

不考虑用用脸艹的 mge 么，很快的（不快俺亲自给你修）。 https://github.com/MegEngine/MegEngine

(@ο@) 哇～好的白座，我先学习学习哈

quinnrong94 commented 3 years ago

@lq0104 在32位低端手机（cpu为四核A7）上测试，对比ncnn和tnn运行速度，以mobilenet_v2为例，单线程耗时差别在5%左右，四线程耗时都约为单线程耗时的60%。针对表格中四线程耗时高于单线程的问题，可能原因是hisi3519为大小核架构，单线程可能是运行在大核上，而多线程情况下部分计算在小核上运行，导致综合性能降低。

Tencent / TNN

关于tnn在海思上的运行以及bfp16的几个小问题 #761