b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 #59

Closed unclemusclez closed 1 month ago

unclemusclez commented 1 month ago

Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory? I understand this is most likely not a priority. Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?

main:

ubuntu@ubuntu:~/distributed-llama$ sudo main chat --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ~/dllama_meta-lla
ma-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 8
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
Killed

Worker

ubuntu@ubuntu:~$ sudo nice -n -20 main worker --port 9998 --nthreads 4]
Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
  what():  std::exception
Aborted
May 19 08:46:24 ubuntu kernel: [107061.602328] main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 19 08:46:24 ubuntu kernel: [107061.602392] CPU: 0 PID: 4676 Comm: main Tainted: G         C  E     5.15.0-1055-raspi #58-Ubuntu
May 19 08:46:24 ubuntu kernel: [107061.602412] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
May 19 08:46:24 ubuntu kernel: [107061.602423] Call trace:
May 19 08:46:24 ubuntu kernel: [107061.602430]  dump_backtrace+0x0/0x200
May 19 08:46:24 ubuntu kernel: [107061.602455]  show_stack+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602470]  dump_stack_lvl+0x8c/0xb8
May 19 08:46:24 ubuntu kernel: [107061.602490]  dump_stack+0x18/0x34
May 19 08:46:24 ubuntu kernel: [107061.602506]  dump_header+0x54/0x21c
May 19 08:46:24 ubuntu kernel: [107061.602520]  oom_kill_process+0x22c/0x230
May 19 08:46:24 ubuntu kernel: [107061.602539]  out_of_memory+0xf4/0x370
May 19 08:46:24 ubuntu kernel: [107061.602554]  __alloc_pages_slowpath.constprop.0+0x604/0x8e0
May 19 08:46:24 ubuntu kernel: [107061.602574]  __alloc_pages+0x29c/0x320
May 19 08:46:24 ubuntu kernel: [107061.602590]  alloc_zeroed_user_highpage_movable+0x40/0x50
May 19 08:46:24 ubuntu kernel: [107061.602607]  do_anonymous_page+0x88/0x4ec
May 19 08:46:24 ubuntu kernel: [107061.602628]  handle_pte_fault+0x170/0x1c0
May 19 08:46:24 ubuntu kernel: [107061.602642]  __handle_mm_fault+0x1d0/0x350
May 19 08:46:24 ubuntu kernel: [107061.602655]  handle_mm_fault+0x108/0x294
May 19 08:46:24 ubuntu kernel: [107061.602669]  faultin_page+0x84/0x150
May 19 08:46:24 ubuntu kernel: [107061.602685]  __get_user_pages+0x194/0x2c0
May 19 08:46:24 ubuntu kernel: [107061.602701]  populate_vma_page_range+0x64/0x70
May 19 08:46:24 ubuntu kernel: [107061.602719]  __mm_populate+0xc4/0x1d0
May 19 08:46:24 ubuntu kernel: [107061.602735]  do_mlock+0xdc/0x26c
May 19 08:46:24 ubuntu kernel: [107061.602750]  __arm64_sys_mlock+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602765]  invoke_syscall+0x50/0x120
May 19 08:46:24 ubuntu kernel: [107061.602784]  el0_svc_common.constprop.0+0x6c/0x1a0
May 19 08:46:24 ubuntu kernel: [107061.602803]  do_el0_svc+0x30/0xb0
May 19 08:46:24 ubuntu kernel: [107061.602820]  el0_svc+0x4c/0x170
May 19 08:46:24 ubuntu kernel: [107061.602837]  el0t_64_sync_handler+0xa4/0x130
May 19 08:46:24 ubuntu kernel: [107061.602854]  el0t_64_sync+0x1a4/0x1a8
May 19 08:46:24 ubuntu kernel: [107061.602888] Mem-Info:
May 19 08:46:24 ubuntu kernel: [107061.602905] active_anon:735 inactive_anon:16569 isolated_anon:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  active_file:36 inactive_file:28 isolated_file:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  unevictable:185356 dirty:0 writeback:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  slab_reclaimable:6070 slab_unreclaimable:10550
May 19 08:46:24 ubuntu kernel: [107061.602905]  mapped:1869 shmem:749 pagetables:923 bounce:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  kernel_misc_reclaimable:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  free:5609 free_pcp:0 free_cma:0
May 19 08:46:24 ubuntu kernel: [107061.602949] Node 0 active_anon:2940kB inactive_anon:66276kB active_file:144kB inactive_file:112kB unevictable:741424kB isolated(anon):0kB isolated(file):0kB mapped:7476kB dirty:0kB writeback:0kB shmem:2996kB >May 19 08:46:24 ubuntu kernel: [107061.602992] DMA free:22436kB min:24576kB low:30208kB high:35840kB reserved_highatomic:0KB active_anon:2940kB inactive_anon:66276kB active_file:196kB inactive_file:292kB unevictable:741332kB writepending:0kB p>May 19 08:46:24 ubuntu kernel: [107061.603035] lowmem_reserve[]: 0 0 0 0
May 19 08:46:24 ubuntu kernel: [107061.603114] DMA: 1113*4kB (UME) 633*8kB (UME) 296*16kB (UME) 129*32kB (UME) 48*64kB (UME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22860kB
May 19 08:46:24 ubuntu kernel: [107061.603406] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 19 08:46:24 ubuntu kernel: [107061.603428] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 19 08:46:24 ubuntu kernel: [107061.603449] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 19 08:46:24 ubuntu kernel: [107061.603469] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 19 08:46:24 ubuntu kernel: [107061.603489] 2704 total pagecache pages
May 19 08:46:24 ubuntu kernel: [107061.603504] 0 pages in swap cache
May 19 08:46:24 ubuntu kernel: [107061.603518] Swap cache stats: add 0, delete 0, find 0/0
May 19 08:46:24 ubuntu kernel: [107061.603536] Free swap  = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603550] Total swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603565] 242688 pages RAM
May 19 08:46:24 ubuntu kernel: [107061.603580] 0 pages HighMem/MovableOnly
May 19 08:46:24 ubuntu kernel: [107061.603594] 10931 pages reserved
May 19 08:46:24 ubuntu kernel: [107061.603609] 16384 pages cma reserved
May 19 08:46:24 ubuntu kernel: [107061.603624] Tasks state (memory values in pages):
May 19 08:46:24 ubuntu kernel: [107061.603638] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 19 08:46:24 ubuntu kernel: [107061.603685] [    379]     0   379    12038      852    94208        0          -250 systemd-journal
May 19 08:46:24 ubuntu kernel: [107061.603716] [    406]     0   406    72414     6415   118784        0         -1000 multipathd
May 19 08:46:24 ubuntu kernel: [107061.603745] [    420]     0   420     5982      942    69632        0         -1000 systemd-udevd
May 19 08:46:24 ubuntu kernel: [107061.603789] [    553]   103   553    22163      732    77824        0             0 systemd-timesyn
May 19 08:46:24 ubuntu kernel: [107061.603819] [    612]   100   612     4068      777    73728        0             0 systemd-network
May 19 08:46:24 ubuntu kernel: [107061.603847] [    614]   101   614     6339     1633    90112        0             0 systemd-resolve
May 19 08:46:24 ubuntu kernel: [107061.603875] [    625]   102   625     2267      838    57344        0          -900 dbus-daemon
May 19 08:46:24 ubuntu kernel: [107061.603904] [    629]     0   629    20487      611    65536        0             0 irqbalance
May 19 08:46:24 ubuntu kernel: [107061.603933] [    634]     0   634     8236     2733   114688        0             0 networkd-dispat
May 19 08:46:24 ubuntu kernel: [107061.603961] [    640]   104   640    55504      826    81920        0             0 rsyslogd
May 19 08:46:24 ubuntu kernel: [107061.603989] [    644]     0   644   366640     2855   249856        0          -900 snapd
May 19 08:46:24 ubuntu kernel: [107061.604017] [    653]     0   653     3887      791    69632        0             0 systemd-logind
May 19 08:46:24 ubuntu kernel: [107061.604045] [    655]     0   655     3809      626    73728        0             0 wpa_supplicant
May 19 08:46:24 ubuntu kernel: [107061.604073] [    683]     0   683     1727      501    45056        0             0 cron
May 19 08:46:24 ubuntu kernel: [107061.604100] [    703]     0   703    27482     2589   110592        0             0 unattended-upgr
May 19 08:46:24 ubuntu kernel: [107061.604128] [    710]     0   710     1408      126    53248        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604155] [    712]     0   712     1397      139    49152        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604183] [    720]     0   720     3788     1039    69632        0         -1000 sshd
May 19 08:46:24 ubuntu kernel: [107061.604211] [    844]     0   844      559       44    36864        0             0 hciattach
May 19 08:46:24 ubuntu kernel: [107061.604239] [    856]     0   856     2384      602    61440        0             0 bluetoothd
May 19 08:46:24 ubuntu kernel: [107061.604266] [   1172]     0  1172    74368     1369   167936        0             0 packagekitd
May 19 08:46:24 ubuntu kernel: [107061.604305] [   1178]     0  1178    58582      814    94208        0             0 polkitd
May 19 08:46:24 ubuntu kernel: [107061.604336] [   4481]     0  4481     4596     1078    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604364] [   4484]  1000  4484     4559     1187    73728        0             0 systemd
May 19 08:46:24 ubuntu kernel: [107061.604391] [   4485]  1000  4485    42829     1235   110592        0             0 (sd-pam)
May 19 08:46:24 ubuntu kernel: [107061.604421] [   4571]  1000  4571     4631      881    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604448] [   4572]  1000  4572     2147      846    53248        0             0 bash
May 19 08:46:24 ubuntu kernel: [107061.604481] [   4674]  1000  4674     3345      616    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604509] [   4675]  1000  4675     3345      172    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604536] [   4676]     0  4676  1725546   180701  1495040        0             0 main
May 19 08:46:24 ubuntu kernel: [107061.604563] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-39.scope,task=main,pid=4676,uid=0
May 19 08:46:24 ubuntu kernel: [107061.604827] Out of memory: Killed process 4676 (main) total-vm:6902184kB, anon-rss:721280kB, file-rss:1524kB, shmem-rss:0kB, UID:0 pgtables:1460kB oom_score_adj:0
May 19 08:46:25 ubuntu systemd[1]: session-39.scope: A process of this unit has been killed by the OOM killer.
b4rtaz commented 1 month ago

TinyLlama seems to work now, so I'm closing this issue.