ahrefs / ocannl

OCANNL: OCaml Compiles Algorithms for Neural Networks Learning
BSD 2-Clause "Simplified" License
67 stars 2 forks source link

Make memory reporting for CUDA more meaningful #289

Open lukstafi opened 1 month ago

lukstafi commented 1 month ago

Using cu_mem_get_info gives results that are not very meaningful.

For example:

┌─────────────────────────────────────────────────────────────────────────────────────────┬───────────┬───────────────┬───────┬─────────┬───────────────────────────────────────────────────┐
│Benchmarks                                                                               │Time in sec│Memory in bytes│Speedup│Mem gain │init time in sec, min loss, last loss              │
├─────────────────────────────────────────────────────────────────────────────────────────┼───────────┼───────────────┼───────┼─────────┼───────────────────────────────────────────────────┤
│seed 7, inline 0, parallel 1, batch 240, backend cc, val prec single, grad prec single   │0.229846796│187036         │5.306  │18431.763│(0.602457722 62.728876709938049 62.728876709938049)│
│seed 7, inline 0, parallel 1, batch 240, backend cc, val prec half, grad prec half       │0.681410625│93522          │1.790  │36861.950│(0.830183092 62.6259765625 62.6259765625)          │
│seed 7, inline 0, parallel 1, batch 240, backend cuda, val prec single, grad prec single │0.796467672│3447403316     │1.531  │1.000    │(3.596758795 62.728905558586121 62.728905558586121)│
│seed 7, inline 0, parallel 1, batch 240, backend cuda, val prec half, grad prec half     │1.219598776│2061500416     │1.000  │1.672    │(3.974976031 62.93798828125 62.93798828125)        │
│seed 7, inline 3, parallel 1, batch 240, backend cc, val prec single, grad prec single   │0.251448531│187036         │4.850  │18431.763│(0.511715823 62.7288755774498 62.7288755774498)    │
│seed 7, inline 3, parallel 1, batch 240, backend cc, val prec half, grad prec half       │0.63360842 │93522          │1.925  │36861.950│(0.585796587 62.30078125 62.30078125)              │
│seed 7, inline 3, parallel 1, batch 240, backend cuda, val prec single, grad prec single │0.657724256│2210398208     │1.854  │1.560    │(0.996566334 62.728905558586121 62.728905558586121)│
│seed 7, inline 3, parallel 1, batch 240, backend cuda, val prec half, grad prec half     │0.779391164│1088421888     │1.565  │3.167    │(1.305761225 62.2236328125 62.2236328125)          │
│seed 7, inline 0, parallel 3, batch 240, backend cc, val prec single, grad prec single   │0.245330525│571884         │4.971  │6028.151 │(0.808980378 62.153002977371216 62.153002977371216)│
│seed 7, inline 0, parallel 3, batch 240, backend cc, val prec half, grad prec half       │0.459211186│285954         │2.656  │12055.797│(1.063122458 62.41552734375 62.41552734375)        │
│seed 7, inline 0, parallel 3, batch 240, backend cuda, val prec single, grad prec single │0.524303261│1352663040     │2.326  │2.549    │(3.233237763 63.376171588897705 63.376171588897705)│
│seed 7, inline 0, parallel 3, batch 240, backend cuda, val prec half, grad prec half     │0.750559389│612368384      │1.625  │5.630    │(5.178235428 62.83740234375 62.83740234375)        │
│seed 7, inline 3, parallel 3, batch 240, backend cc, val prec single, grad prec single   │0.246047198│571884         │4.957  │6028.151 │(0.72678405 62.152995347976685 62.152995347976685) │
│seed 7, inline 3, parallel 3, batch 240, backend cc, val prec half, grad prec half       │0.446806293│285954         │2.730  │12055.797│(0.838345553 62.47265625 62.47265625)              │
│seed 7, inline 3, parallel 3, batch 240, backend cuda, val prec single, grad prec single │0.558565954│715128832      │2.183  │4.821    │(1.419007865 63.376166462898254 63.376166462898254)│
│seed 7, inline 3, parallel 3, batch 240, backend cuda, val prec half, grad prec half     │0.662616926│341835776      │1.841  │10.085   │(2.182560358 62.17529296875 62.17529296875)        │
│seed 7, inline 0, parallel 6, batch 240, backend cc, val prec single, grad prec single   │0.324366117│1176096        │3.760  │2931.226 │(1.099047585 61.730027139186859 61.730027139186859)│
│seed 7, inline 0, parallel 6, batch 240, backend cc, val prec half, grad prec half       │0.537282895│588072         │2.270  │5862.213 │(1.315531069 62.76953125 62.76953125)              │
│seed 7, inline 0, parallel 6, batch 240, backend cuda, val prec single, grad prec single │0.557164894│580911104      │2.189  │5.934    │(2.652769076 63.376184284687042 63.376184284687042)│
│seed 7, inline 0, parallel 6, batch 240, backend cuda, val prec half, grad prec half     │0.659206927│297795584      │1.850  │11.576   │(4.897720286 62.7421875 62.7421875)                │
│seed 7, inline 3, parallel 6, batch 240, backend cc, val prec single, grad prec single   │0.327492657│1176096        │3.724  │2931.226 │(0.945304816 61.718904912471771 61.718904912471771)│
│seed 7, inline 3, parallel 6, batch 240, backend cc, val prec half, grad prec half       │0.496853717│588072         │2.455  │5862.213 │(1.055382175 60.982421875 60.982421875)            │
│seed 7, inline 3, parallel 6, batch 240, backend cuda, val prec single, grad prec single │0.484854294│337641472      │2.515  │10.210   │(1.661079693 63.376177906990051 63.376177906990051)│
│seed 7, inline 3, parallel 6, batch 240, backend cuda, val prec half, grad prec half     │0.637598667│153092096      │1.913  │22.518   │(2.544604816 62.099609375 62.099609375)            │
│seed 7, inline 0, parallel 12, batch 240, backend cc, val prec single, grad prec single  │0.374894618│2481504        │3.253  │1389.239 │(1.55095354 61.862113118171692 61.862113118171692) │
│seed 7, inline 0, parallel 12, batch 240, backend cc, val prec half, grad prec half      │0.565150972│1240800        │2.158  │2778.371 │(1.795796058 62.04931640625 62.04931640625)        │
│seed 7, inline 0, parallel 12, batch 240, backend cuda, val prec single, grad prec single│0.579294911│276824064      │2.105  │12.453   │(2.876144217 63.376185953617096 63.376185953617096)│
│seed 7, inline 0, parallel 12, batch 240, backend cuda, val prec half, grad prec half    │0.697255179│153092096      │1.749  │22.518   │(4.924106615 62.80078125 62.80078125)              │
│seed 7, inline 3, parallel 12, batch 240, backend cc, val prec single, grad prec single  │0.363463621│2481504        │3.355  │1389.239 │(1.313944785 61.862080454826355 61.862080454826355)│
│seed 7, inline 3, parallel 12, batch 240, backend cc, val prec half, grad prec half      │0.562140134│1240800        │2.170  │2778.371 │(1.499167458 61.90234375 61.90234375)              │
│seed 7, inline 3, parallel 12, batch 240, backend cuda, val prec single, grad prec single│0.596052431│180355072      │2.046  │19.115   │(2.841286029 63.376178562641144 63.376178562641144)│
│seed 7, inline 3, parallel 12, batch 240, backend cuda, val prec half, grad prec half    │0.663990027│67108864       │1.837  │51.370   │(4.696311311 61.94580078125 61.94580078125)        │
│seed 7, inline 0, parallel 16, batch 240, backend cc, val prec single, grad prec single  │0.474191279│3423616        │2.572  │1006.948 │(1.769277872 61.757832944393158 61.757832944393158)│
│seed 7, inline 0, parallel 16, batch 240, backend cc, val prec half, grad prec half      │0.577466576│1711872        │2.112  │2013.821 │(2.305998903 61.90576171875 61.90576171875)        │
│seed 7, inline 0, parallel 16, batch 240, backend cuda, val prec single, grad prec single│0.620073868│186646528      │1.967  │18.470   │(2.423241004 63.376178324222565 63.376178324222565)│
│seed 7, inline 0, parallel 16, batch 240, backend cuda, val prec half, grad prec half    │0.764845615│88080384       │1.595  │39.139   │(4.584916593 62.71826171875 62.71826171875)        │
│seed 7, inline 3, parallel 16, batch 240, backend cc, val prec single, grad prec single  │0.412470958│3423616        │2.957  │1006.948 │(1.665891434 61.757833182811737 61.757833182811737)│
│seed 7, inline 3, parallel 16, batch 240, backend cc, val prec half, grad prec half      │0.59185156 │1711872        │2.061  │2013.821 │(1.89430643 61.8232421875 61.8232421875)           │
│seed 7, inline 3, parallel 16, batch 240, backend cuda, val prec single, grad prec single│0.617704358│109051904      │1.974  │31.613   │(2.188090388 63.376178324222565 63.376178324222565)│
│seed 7, inline 3, parallel 16, batch 240, backend cuda, val prec half, grad prec half    │0.741150168│41943040       │1.646  │82.193   │(3.408914471 61.9169921875 61.9169921875)          │
└─────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────────────┴───────┴─────────┴────────────────────────