The design and optimization of API Benchmark

Program = feed + abs + fetch

profile数据


------------------------->     Profiling Report     <-------------------------

Place: All Time unit: ms Sorted by total time in descending order in the same thread

Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. thread0::GpuMemcpySync:GPU->CPU 10 65.3952 39.898246 (0.610110) 25.496945 (0.389890) 6.46307 6.75515 6.53952 0.434411 thread0::fetch 10 42.5865 37.686449 (0.884939) 4.900031 (0.115061) 4.15256 4.90003 4.25865 0.282896 thread0::TensorCopySync:GPU->CPU 10 41.8076 37.494616 (0.896837) 4.313012 (0.103163) 4.13309 4.31301 4.18076 0.277722 thread0::abs 10 0.6688 0.450827 (0.674083) 0.217973 (0.325917) 0.052712 0.134069 0.06688 0.00444274 thread0::feed 10 0.079468 0.064448 (0.810993) 0.015020 (0.189007) 0.005744 0.01502 0.0079468 0.000527895

{ name: "abs", device: "GPU", precision: { stable: "True", diff: 0.00000 }, speed: { repeat: 10, start: 1, end: 9, total: 5.08994, feed: 0.00000, compute: 0.00000, fetch: 0.00000 } }



- feed数据的CPU->GPU传输，是在Executor里面设置feed数据时已经开始传输，不是在feed op里面传输的
![image](https://user-images.githubusercontent.com/12538138/69705220-0ea4b880-1130-11ea-8417-a4bd8661283b.png)

- fetch数据的GPU->CPU传输是发生在fetch op里面，最下面gpu操作结束之后，cuda_api这一层还有很长的时间。
![image](https://user-images.githubusercontent.com/12538138/69705308-401d8400-1130-11ea-9f67-3eac557acd1d.png)

PaddlePaddle / benchmark

The design and optimization of API Benchmark #284

Program = feed + abs + fetch