Multi2Sim / multi2sim

Multi2Sim source code
GNU General Public License v3.0
115 stars 66 forks source link

Simulation stall with fused architecture #44

Open marcusnchow opened 7 years ago

marcusnchow commented 7 years ago

Multi2sim v5.0 stalls when running the amdsdk benchmarks with the fused config files. This happens when running it with --x86-sim detailed --si-sim detailed

Sometimes I will get the TOO LONG error, but this doesn't allows occur.

ttungl commented 7 years ago

Hi mchow24,

What do you mean "TOO LONG error"? Does it mean the error report is too long, or it would take too long time to simulate the problem? If it takes too long, two possible reasons. First one is, the deadlock occurs in the interconnection network. Second one is, the selected benchmark simply needs more time to process depending on your system's size.

You can check by running m2s for the amdsdk benchmark only (without the architecture), the result should be as below (I ran a BinarySearch benchmark as an example). *command line (under BinarySearch benchmark subdirectory): BinarySearch$ m2s ./BinarySearch --load BinarySearch_Kernels.bin -e

Output:

; Multi2Sim 5.0 - A Simulation Framework for CPU-GPU Heterogeneous Computing
; Please use command 'm2s --help' for a list of command-line options.
; Simulation alpha-numeric ID: rS8un

Platform 0 : Multi2Sim

Sorted Input
0 1 1 1 1 2 2 2 2 2 2 2 3 4 5 5 6 7 8 8 9 9 10 11 11 11 12 13 13 13 13 13 13 14 15 15 15 15 16 17 17 18 19 19 20 21 22 23 23 23 24 24 25 25 26 27 27 28 29 29 29 30 31 31 32 32 33 33 33 33 34 35 35 36 37 38 38 39 39 40 41 42 43 43 44 45 45 45 45 46 47 47 48 48 48 49 50 51 52 52 52 52 52 52 52 52 52 52 53 53 53 53 53 54 55 55 55 56 56 56 57 57 58 59 60 61 61 61 62 63 64 64 64 65 65 65 66 66 67 67 68 69 70 71 71 71 71 72 72 73 73 73 74 75 76 76 77 78 79 79 80 80 81 82 83 84 84 85 85 86 87 87 87 87 87 88 89 89 89 90 91 92 93 94 94 94 94 95 95 96 96 96 96 97 97 97 98 98 99 99 100 101 101 101 101 101 102 103 104 104 104 104 105 106 106 106 106 106 107 108 108 108 109 110 111 112 112 113 114 115 116 117 118 118 118 119 120 120 120 120 121 121 122 123 123 124 125 126 126 127 127 128 129 129 129 130 130 130 130 130 130 130 131 131 131 131 132 133 133 134 135 136 136 137 137 138 139 140 141 142 143 144 144 144 144 145 146 146 147 147 147 147 147 147 147 147 147 147 147 148 149 149 149 149 150 150 150 150 150 151 151 152 153 153 153 154 155 155 156 157 157 157 157 157 157 157 158 159 159 159 159 159 159 160 161 162 162 163 163 163 164 164 164 164 165 166 167 168 169 170 171 172 172 173 173 174 174 174 174 175 175 176 176 176 176 177 177 177 178 178 179 180 181 181 182 182 183 184 184 185 186 187 187 188 188 189 189 190 191 192 192 192 192 193 193 194 195 196 196 196 197 198 199 200 200 201 201 202 202 202 202 203 204 205 206 206 207 208 208 208 209 209 210 210 210 211 212 213 213 214 215 215 216 217 218 219 220 221 222 223 224 224 225 226 226 226 226 227 227 227 227 228 229 230 231 232 233 233 233 234 234 234 234 235 235 236 237 238 239 239 240 241 242 242 243 244 244 245 245 245 246 247 247 248 248 249 249 250 250 250 251 252 253 254 255 256 257 257 258 259 260 260 260 260 260 261 261 262 263 263 263 264 

Selected Platform Vendor : Multi2Sim
Device 0 : Multi2Sim Southern Islands GPU Model
Executing kernel for 1 iterations
-------------------------------------------
l = 14, u = 15, isfound = 1, fm = 5
Passed!

;
; Simulation Statistics Summary
;

[ General ]
RealTime = 0.30 [s]
SimEnd = ContextsFinished

[ x86 ]
RealTime = 0.30 [s]
Instructions = 1028427
InstructionsPerSecond = 3482180

[ SouthernIslands ]
RealTime = 0.01 [s]
Instructions = 128
InstructionsPerSecond = 8775
NDRangeCount = 1
WorkGroupCount = 1
BranchInstructions = 5
LDSInstructions = 0
ScalarALUInstructions = 37
ScalarMemInstructions = 33
VectorALUInstructions = 46
VectorMemInstructions = 7

If you can get the result as above, the error may be from your architecture, not the benchmarks themselves. In addition, you can post the commands and the specific errors in detail here, thereby, someone, who have coped with those errors, will help you out. Best,

marcusnchow commented 7 years ago

ttungl,

The TOO LONG error I get is when the simulator exits because of a stall due to the southern islands. The error is in src/arch/southern-islands/timing/Timing.cc I think it is related to issue#30. This error does not appear all of the time for me and I can't reproduce at this time.

The benchmarks work when i run it as you mentioned. My problem is run I try it in detailed mode for example,

m2s --si-sim detailed --x86-sim detailed --mem-config mem-config --net-config net-config /
--x86-config x86-config --si-config si-config BinarySearch --load BinarySearch_Kernels.bin

I have run it for a few other benchmarks as well and I don't get any kind of output, the most I get is

; Multi2Sim 5.0 - A Simulation Framework for CPU-GPU Heterogeneous Computing
; Please use command 'm2s --help' for a list of command-line options.
; Simulation alpha-numeric ID: OySmO

so far I have run the benchmarks for about a day. Do I need to give them a longer running time?

Also, the config files I am using all come from samples/fused. Are these files know to work?

Thanks for the help

ttungl commented 7 years ago

First, you can download the latest version for HSA support at this link to make sure your HSA will work properly.

Second, I ran a small command test with 16 cores CPU and 16 compute units GPU, and 4 MCs in a 2D-Mesh Network as below. The result shows that, deadlock occurs during the simulation. This is because I did not add the routing protocol in the configuration file (net-config). By default, multi2sim will use Floyd-Warshall algorithm for finding the shortest routes from sources to sinks. Thus, it does not have deadlock-free, leading to a result that some workloads with the intensive concurrency can be stalled because of the deadlock/livelock. In this case below, it's deadlock. Another reason could be, the sizes of caches or virtual channels are too small. You could try to adjust to the proper amount.

You can learn some samples about routing setting in the chapter 10. I would recommend that XY-routing is a good starting point for fixing this error, since that routing protocol is deadlock-free.

Note, make sure when you run the command line, add the paths for the workload files.

*Result:

; Multi2Sim 5.0 - A Simulation Framework for CPU-GPU Heterogeneous Computing
            ; Please use command 'm2s --help' for a list of command-line options.
            ; Simulation alpha-numeric ID: fqgiH

            Warning: [x86] Core 0 Thread 0: simulation ended due to a commit stall.

                    The CPU commit stage has not received any instruction for 1M cycles.
                    Most likely, this means that a deadlock condition occurred in the
                    management of some modeled structure (network, cache system, core
                    queues, etc.).

            Core 0 - Thread 0
            =================

            Register file
            -------------

            Integer registers:
                20 occupied, 65 free, 85 total
                Mappings:
                    eax        -> 60
                    ecx        -> 25
                    edx        -> 30
                    ebx        -> 18
                    esp        -> 53
                    ebp        -> 65
                    esi        -> 59
                    edi        -> 44
                    es         -> 76
                    cs         -> 75
                    ss         -> 74
                    ds         -> 73
                    fs         -> 72
                    gs         -> 16
                    zps        -> 20
                    of         -> 20
                    cf         -> 20
                    df         -> 38
                    aux        -> 45
                    aux2       -> 69
                    ea         -> 28
                    data       -> 14
                Free registers: { 48 35 0 56 11 39 63 47 77 41 23 2 78 4 42 50 6 82 3 43 71 8 40 33 84 68 12 24 9 55 7 66 26 62 31 17 58 51 46 49 70 64 32 67 22 10 79 54 5 83 57 81 36 34 13 52 19 61 27 37 29 15 1 80 21 }

            Floating-point registers:
                11 occupied, 32 free, 43 total
                Mappings:
                    st0        -> 42
                    st1        -> 41
                    st2        -> 40
                    st3        -> 39
                    st4        -> 38
                    st5        -> 37
                    st6        -> 36
                    st7        -> 30
                    fpst       -> 34
                    fpcw       -> 33
                    fpaux      -> 32
                Free registers: { 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 35 31 }

            XMM registers:
                9 occupied, 31 free, 40 total
                Mappings:
                    xmm0       -> 39
                    xmm1       -> 38
                    xmm2       -> 37
                    xmm3       -> 36
                    xmm4       -> 35
                    xmm5       -> 34
                    xmm6       -> 33
                    xmm7       -> 32
                    xmm_data   -> 31
                Free registers: { 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 }

            Fetch queue
            -----------

            -Empty-

            Uop queue
            ---------

            -Empty-

            Reorder buffer
            --------------

            -Empty-

            Instruction queue
            -----------------

            -Empty-

            Load queue
            ----------

            -Empty-

            Store queue
            -----------

            -Empty-

            ;
            ; Simulation Statistics Summary
            ;

            [ General ]
            RealTime = 398.06 [s]
            SimEnd = Stall
            SimTime = 791059.78 [ns]
            Frequency = 2400 [MHz]
            Cycles = 1901587

            [ SouthernIslands ]
            RealTime = 0.00 [s]
            Instructions = 0
            InstructionsPerSecond = 0
            NDRangeCount = 0
            WorkGroupCount = 0
            BranchInstructions = 0
            LDSInstructions = 0
            ScalarALUInstructions = 0
            ScalarMemInstructions = 0
            VectorALUInstructions = 0
            VectorMemInstructions = 0
            SimTime = 791060.00 [ns]
            Frequency = 1000 [MHz]
            Cycles = 791060
            CyclesPerSecond = 0

            [ x86 ]
            RealTime = 397.99 [s]
            Instructions = 14991954
            InstructionsPerSecond = 37669
            SimTime = 791060.19 [ns]
            Frequency = 2400 [MHz]
            Cycles = 1901587
            CyclesPerSecond = 4778
            FastForwardInstructions = 0
            CommittedInstructions = 14434642
            CommittedInstructionsPerCycle = 7.591
            CommittedMicroInstructions = 28832908
            CommittedMicroInstructionsPerCycle = 15.16
            BranchPredictionAccuracy = 0.9958
syifan commented 7 years ago

@ttungl It is interesting to know that Floyd-Warshall is not deadlock-free. I do not enough related knowledge. Can you point me some reference about how to implement a deadlock-free routing? That may help to solve this bug.

ttungl commented 7 years ago

@syifan Thanks for your discussion. As I know, the authors in this paper have addressed the deadlock problem in a dependency graph, especially in the customized network topologies. I think it's important because in the near future, the topologies of the networks-on-chip could be irregular, and XY-routing is not a good choice, instead, multipath routing algorithms will be dominated to help solving the problem. Thus, the paper may help to figuring out someways to fix this bug. @mchow24 Sorry for distracting you on this ;)

marcusnchow commented 7 years ago

@ttungl Thanks for the overview, I am new to multi2sim and this helped a lot. From what I can tell in the guide, in XY-routing I will have to manual state the routing path for every node in the config file, correct?

ttungl commented 7 years ago

@mchow24 At this point, the answer for your question is yes. You can do it for a small network, however, when the network is scaling up, it'd be better to write code to do it for you.

trinayan commented 7 years ago

hi guys,

We have noticed this problem as well. First of all m2s 5.0 does not support a fused architecture yet. The support for that is being added at the moment and the ABI calls needs to be modified to get the fused architecture running and some other changes as well. @ttungl : We would be interested in knowing how you are implementing the fused architecture

and we have faced the deadlock issue as well. Currently we just fast forward to the part until the opencl nd range starts running and then run the detailed simulator to solve this problem. This works. But as @ttungl suggested the network protocol might be a problem. I would have to check this issue

ttungl commented 7 years ago

@trinayan Regarding the implementation, I use the updated version of multi2sim in hsa_update branch (thanks to @syifan for letting me know). I wrote some scripts to generate the configuration files based on the customized inputs. As said, the paper points out the need of avoiding the deadlock in the customized network topologies. Thanks.