ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Optimize memory usage for distributed Keras training #690

Closed Liu-Delin closed 4 years ago

Liu-Delin commented 4 years ago

Description

  1. Use numpy.float32 instead of float.
    • numpy.float32 is 4 bytes, but float is 24 bytes.
  2. Use numpy.array instead of list.

Tests

Manual tested shifu train with a 130,000 lines file(200+MB in .gz format).

  1. Use the optimized code with 2GB memory workers.
    2020-01-10 22:26:22 INFO  TensorflowSession:611 - Epoch 0 is finish..
    2020-01-10 22:26:22 INFO  TensorflowSession:536 - Epoch: 0 training error: 4452.271 valid error: 4129.09545 avg training time: 14.126334547999999 avg valid time: 0.06444346904755
    2020-01-10 22:41:59 INFO  TensorflowSession:611 - Epoch 1 is finish..
    2020-01-10 22:41:59 INFO  TensorflowSession:536 - Epoch: 1 training error: 2350.85045 valid error: 2073.3867499999997 avg training time: 12.7803850174 avg valid time: 0.0230430364609
    2020-01-10 22:45:15 WARN  FileTxnLog:334 - fsync-ing the write ahead log in SyncThread:0 took 1005ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
    2020-01-10 22:45:15 WARN  FileTxnLog:334 - fsync-ing the write ahead log in SyncThread:0 took 1005ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
    2020-01-10 22:50:21 INFO  TensorflowSession:611 - Epoch 2 is finish..
    2020-01-10 22:50:21 INFO  TensorflowSession:536 - Epoch: 2 training error: 1320.1459 valid error: 1309.14365 avg training time: 15.283201575300001 avg valid time: 0.01835501194
    2020-01-10 22:58:16 INFO  TensorflowSession:611 - Epoch 3 is finish..
    2020-01-10 22:58:16 INFO  TensorflowSession:536 - Epoch: 3 training error: 804.77055 valid error: 1012.67535 avg training time: 11.38035845758 avg valid time: 0.017961382865900002
    2020-01-10 23:06:41 INFO  TensorflowSession:611 - Epoch 4 is finish..
    2020-01-10 23:06:41 INFO  TensorflowSession:536 - Epoch: 4 training error: 541.38885 valid error: 867.3281549999999 avg training time: 11.648380398745001 avg valid time: 0.0173770189285
    2020-01-10 23:14:48 INFO  TensorflowSession:611 - Epoch 5 is finish..
    2020-01-10 23:14:48 INFO  TensorflowSession:536 - Epoch: 5 training error: 410.887 valid error: 783.0587 avg training time: 14.4866925478 avg valid time: 0.02211344242095
    2020-01-10 23:22:13 INFO  TensorflowSession:611 - Epoch 6 is finish..
    2020-01-10 23:22:13 INFO  TensorflowSession:536 - Epoch: 6 training error: 331.53522499999997 valid error: 726.9208000000001 avg training time: 12.766790509219998 avg valid time: 0.019877552986149998
    2020-01-10 23:30:02 INFO  TensorflowSession:611 - Epoch 7 is finish..
    2020-01-10 23:30:02 INFO  TensorflowSession:536 - Epoch: 7 training error: 276.54769999999996 valid error: 686.0115000000001 avg training time: 11.48632752895 avg valid time: 0.0173213481903
    2020-01-10 23:37:50 INFO  TensorflowSession:611 - Epoch 8 is finish..
    2020-01-10 23:37:50 INFO  TensorflowSession:536 - Epoch: 8 training error: 235.180155 valid error: 653.503015 avg training time: 12.69085681439 avg valid time: 0.01979994773865
  2. Use the old code with 6GB memory workers.
    2020-01-11 00:02:21 INFO  TensorflowSession:611 - Epoch 0 is finish..
    2020-01-11 00:02:21 INFO  TensorflowSession:536 - Epoch: 0 training error: 4446.359 valid error: 5216.47485 avg training time: 11.79280042645 avg valid time: 0.138170003891
    2020-01-11 00:20:10 INFO  TensorflowSession:611 - Epoch 1 is finish..
    2020-01-11 00:20:10 INFO  TensorflowSession:536 - Epoch: 1 training error: 2587.93865 valid error: 2910.67 avg training time: 10.8711990118 avg valid time: 0.07845103740695
    2020-01-11 00:27:17 INFO  TensorflowSession:611 - Epoch 2 is finish..
    2020-01-11 00:27:17 INFO  TensorflowSession:536 - Epoch: 2 training error: 1597.1203 valid error: 1807.56135 avg training time: 13.50664448735 avg valid time: 0.08722949028015001
    2020-01-11 00:35:08 INFO  TensorflowSession:611 - Epoch 3 is finish..
    2020-01-11 00:35:08 INFO  TensorflowSession:536 - Epoch: 3 training error: 1101.06755 valid error: 1289.1664 avg training time: 11.25977563855 avg valid time: 0.08584654331205
    2020-01-11 00:42:24 INFO  TensorflowSession:611 - Epoch 4 is finish..
    2020-01-11 00:42:24 INFO  TensorflowSession:536 - Epoch: 4 training error: 824.0795 valid error: 1015.9642699999999 avg training time: 19.25335097315 avg valid time: 0.08593952655789999
    2020-01-11 00:49:28 INFO  TensorflowSession:611 - Epoch 5 is finish..
    2020-01-11 00:49:28 INFO  TensorflowSession:536 - Epoch: 5 training error: 656.08635 valid error: 854.59115 avg training time: 11.05206656455 avg valid time: 0.08226001262665
    2020-01-11 00:56:39 INFO  TensorflowSession:611 - Epoch 6 is finish..
    2020-01-11 00:56:39 INFO  TensorflowSession:536 - Epoch: 6 training error: 543.03485 valid error: 747.95453 avg training time: 9.91331553457 avg valid time: 0.07909858226774999
    2020-01-11 01:03:33 INFO  TensorflowSession:611 - Epoch 7 is finish..
    2020-01-11 01:03:34 INFO  TensorflowSession:536 - Epoch: 7 training error: 461.09844999999996 valid error: 672.3511 avg training time: 10.20000398161 avg valid time: 0.08544349670415
    2020-01-11 01:12:04 INFO  TensorflowSession:611 - Epoch 8 is finish..
    2020-01-11 01:12:04 INFO  TensorflowSession:536 - Epoch: 8 training error: 399.5869 valid error: 615.57425 avg training time: 14.81870162485 avg valid time: 0.0898219347001
  3. Use the old code with 2GB memory workers.
    2020-01-11 02:58:57 ERROR AMRMCallbackHandler:123 - Container [pid=10630,containerID=container_e285_1577170850959_504332_01_000011] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 39.4 GB of 4.2 GB virtual memory used. Killing container.