RRZE-HPC / kerncraft

Loop Kernel Analysis and Performance Modeling Toolkit
GNU Affero General Public License v3.0
88 stars 24 forks source link

Wrong data prediction for the code attached in SIM mode #75

Closed christiealappatt closed 6 years ago

christiealappatt commented 6 years ago

The following code has a problem in the calculation of iteration counts. This could be clearly observed by the deviation in ECM and Benchmark modes. wrong_cl.tar.gz

rrzeschorscherl commented 6 years ago

Works for me with Intel 17.0u5, Intel 16.0u4, gcc 6.1.0, and gcc 5.4.0 as long as the compiler can vectorize it (I mean the prediction; I have not checked the performance). It fails for clang 4.0 (despite vectorization) because of the weird way this compiler handles the addresses and the loop counter.

christiealappatt commented 6 years ago

In general this works, but didn't work for me for the particular code attached.. Actually, yesterday we analyzed it and we think it's coming from cache simulator (It does not happen in LC mode.). The cy/CL (bytes/CL) predicted is lower than what it should be. The heading of this issue is somewhat misleading in this case, therefore I have edited it.

christiealappatt commented 6 years ago

Similar effect could be found in this code too. Here the total evicts count from MEM seems to be little off. I ran the attached code as: kerncraft -p ECM -m ~/Emmy.yml -D n 2500000 -D s 4 LC_lij_1.c -vvv , with Intel 17.0up03 compiler. At this size all the data for Y and F should come from MEM (also verified by Benchmark mode), therefore I expect from memory 24 bytes/iteration (2 Load and 1 Store), but kerncraft predicts 20 bytes/iteration (2 Loads and 0.52 Stores). The 0.52 stores then increases slowly as n is increased and finally reaches 1 at n ~ 5000000. LC_lij_example.tar.gz

cod3monk commented 6 years ago

Should be resolved with 1721674e0de79f209947118b3511a55e49c28194. I also noticed that a similar issued affected the LC prediction, if accesses with constant offsets are present.