b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

add distributed llama on docker container test #11

Open weedge opened 4 months ago

weedge commented 4 months ago
# 1 worker + inference
make docker-1-worker-inference
# 3 workers + inference like this:
make docker-3-worker-inference WORKERS="172.18.0.2:9997 172.18.0.3:9997 172.18.0.4:9997"

my local test on docker containers: (use default checkpoint: stories42M.bin)

  1. 1 worker (1 thread) + inference (1 thread)
    💡 dim: 512
    💡 hiddenDim: 1376
    💡 nLayers: 8
    💡 nHeads: 8
    💡 nKvHeads: 8
    💡 vocabSize: 32000
    💡 seqLen: 1024
    💡 nSlices: 2
    ⏩ Loaded 232556544 bytes
    🔶 G   38 ms I   38 ms T    0 ms S  49477 kB R     61 kB Hello
    🔶 G   42 ms I   39 ms T    2 ms S     69 kB R     61 kB  was
    🔶 G   44 ms I   42 ms T    1 ms S     69 kB R     61 kB  in
    🔶 G   44 ms I   39 ms T    5 ms S     69 kB R     61 kB  the
    🔶 G   42 ms I   42 ms T    0 ms S     69 kB R     61 kB  park
    🔶 G   47 ms I   45 ms T    2 ms S     69 kB R     61 kB .
    🔶 G   44 ms I   41 ms T    2 ms S     69 kB R     61 kB  It
    🔶 G   43 ms I   40 ms T    3 ms S     69 kB R     61 kB  was
    🔶 G   42 ms I   39 ms T    3 ms S     69 kB R     61 kB  a
    🔶 G   40 ms I   39 ms T    1 ms S     69 kB R     61 kB  beautiful
    🔶 G   42 ms I   38 ms T    4 ms S     69 kB R     61 kB  day
    🔶 G   43 ms I   40 ms T    2 ms S     69 kB R     61 kB ,
    🔶 G   43 ms I   39 ms T    3 ms S     69 kB R     61 kB  and
    🔶 G   41 ms I   39 ms T    1 ms S     69 kB R     61 kB  the
    🔶 G   47 ms I   40 ms T    6 ms S     69 kB R     61 kB  sun
    🔶 G   45 ms I   41 ms T    4 ms S     69 kB R     61 kB  was
    Generated tokens:    16
    Avg generation time: 42.94 ms
    Avg inference time:  40.06 ms
    Avg transfer time:   2.44 ms
    1. 3 worker (1 thread) + inference (1 thread)
      💡 dim: 512
      💡 hiddenDim: 1376
      💡 nLayers: 8
      💡 nHeads: 8
      💡 nKvHeads: 8
      💡 vocabSize: 32000
      💡 seqLen: 1024
      💡 nSlices: 4
      ⏩ Loaded 232556544 bytes
      🔶 G   41 ms I   34 ms T    7 ms S  74352 kB R     92 kB Hello
      🔶 G   48 ms I   42 ms T    5 ms S    240 kB R     92 kB  was
      🔶 G   65 ms I   45 ms T   18 ms S    240 kB R     92 kB  in
      🔶 G   45 ms I   34 ms T   10 ms S    240 kB R     92 kB  the
      🔶 G   35 ms I   33 ms T    2 ms S    240 kB R     92 kB  park
      🔶 G   38 ms I   34 ms T    3 ms S    240 kB R     92 kB .
      🔶 G   43 ms I   35 ms T    8 ms S    240 kB R     92 kB  It
      🔶 G   47 ms I   38 ms T    8 ms S    240 kB R     92 kB  was
      🔶 G   41 ms I   34 ms T    7 ms S    240 kB R     92 kB  a
      🔶 G   45 ms I   38 ms T    6 ms S    240 kB R     92 kB  beautiful
      🔶 G   37 ms I   35 ms T    2 ms S    240 kB R     92 kB  day
      🔶 G   36 ms I   33 ms T    3 ms S    240 kB R     92 kB .
      🔶 G   40 ms I   35 ms T    5 ms S    240 kB R     92 kB  There
      🔶 G   40 ms I   35 ms T    5 ms S    240 kB R     92 kB  was
      🔶 G   36 ms I   33 ms T    2 ms S    240 kB R     92 kB  a
      🔶 G   41 ms I   33 ms T    8 ms S    240 kB R     92 kB  bird
      Generated tokens:    16
      Avg generation time: 42.38 ms
      Avg inference time:  35.69 ms
      Avg transfer time:   6.19 ms
zhengpeirong commented 1 month ago

Cool! This is of importance for research. But this project has changed a lot since your last commit. Could you please update a bit for the latest version?