Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k
stars
69
forks
source link
(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 #59
Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory?
I understand this is most likely not a priority.
Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?
ubuntu@ubuntu:~$ sudo nice -n -20 main worker --port 9998 --nthreads 4]
Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
what(): std::exception
Aborted
May 19 08:46:24 ubuntu kernel: [107061.602328] main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 19 08:46:24 ubuntu kernel: [107061.602392] CPU: 0 PID: 4676 Comm: main Tainted: G C E 5.15.0-1055-raspi #58-Ubuntu
May 19 08:46:24 ubuntu kernel: [107061.602412] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
May 19 08:46:24 ubuntu kernel: [107061.602423] Call trace:
May 19 08:46:24 ubuntu kernel: [107061.602430] dump_backtrace+0x0/0x200
May 19 08:46:24 ubuntu kernel: [107061.602455] show_stack+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602470] dump_stack_lvl+0x8c/0xb8
May 19 08:46:24 ubuntu kernel: [107061.602490] dump_stack+0x18/0x34
May 19 08:46:24 ubuntu kernel: [107061.602506] dump_header+0x54/0x21c
May 19 08:46:24 ubuntu kernel: [107061.602520] oom_kill_process+0x22c/0x230
May 19 08:46:24 ubuntu kernel: [107061.602539] out_of_memory+0xf4/0x370
May 19 08:46:24 ubuntu kernel: [107061.602554] __alloc_pages_slowpath.constprop.0+0x604/0x8e0
May 19 08:46:24 ubuntu kernel: [107061.602574] __alloc_pages+0x29c/0x320
May 19 08:46:24 ubuntu kernel: [107061.602590] alloc_zeroed_user_highpage_movable+0x40/0x50
May 19 08:46:24 ubuntu kernel: [107061.602607] do_anonymous_page+0x88/0x4ec
May 19 08:46:24 ubuntu kernel: [107061.602628] handle_pte_fault+0x170/0x1c0
May 19 08:46:24 ubuntu kernel: [107061.602642] __handle_mm_fault+0x1d0/0x350
May 19 08:46:24 ubuntu kernel: [107061.602655] handle_mm_fault+0x108/0x294
May 19 08:46:24 ubuntu kernel: [107061.602669] faultin_page+0x84/0x150
May 19 08:46:24 ubuntu kernel: [107061.602685] __get_user_pages+0x194/0x2c0
May 19 08:46:24 ubuntu kernel: [107061.602701] populate_vma_page_range+0x64/0x70
May 19 08:46:24 ubuntu kernel: [107061.602719] __mm_populate+0xc4/0x1d0
May 19 08:46:24 ubuntu kernel: [107061.602735] do_mlock+0xdc/0x26c
May 19 08:46:24 ubuntu kernel: [107061.602750] __arm64_sys_mlock+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602765] invoke_syscall+0x50/0x120
May 19 08:46:24 ubuntu kernel: [107061.602784] el0_svc_common.constprop.0+0x6c/0x1a0
May 19 08:46:24 ubuntu kernel: [107061.602803] do_el0_svc+0x30/0xb0
May 19 08:46:24 ubuntu kernel: [107061.602820] el0_svc+0x4c/0x170
May 19 08:46:24 ubuntu kernel: [107061.602837] el0t_64_sync_handler+0xa4/0x130
May 19 08:46:24 ubuntu kernel: [107061.602854] el0t_64_sync+0x1a4/0x1a8
May 19 08:46:24 ubuntu kernel: [107061.602888] Mem-Info:
May 19 08:46:24 ubuntu kernel: [107061.602905] active_anon:735 inactive_anon:16569 isolated_anon:0
May 19 08:46:24 ubuntu kernel: [107061.602905] active_file:36 inactive_file:28 isolated_file:0
May 19 08:46:24 ubuntu kernel: [107061.602905] unevictable:185356 dirty:0 writeback:0
May 19 08:46:24 ubuntu kernel: [107061.602905] slab_reclaimable:6070 slab_unreclaimable:10550
May 19 08:46:24 ubuntu kernel: [107061.602905] mapped:1869 shmem:749 pagetables:923 bounce:0
May 19 08:46:24 ubuntu kernel: [107061.602905] kernel_misc_reclaimable:0
May 19 08:46:24 ubuntu kernel: [107061.602905] free:5609 free_pcp:0 free_cma:0
May 19 08:46:24 ubuntu kernel: [107061.602949] Node 0 active_anon:2940kB inactive_anon:66276kB active_file:144kB inactive_file:112kB unevictable:741424kB isolated(anon):0kB isolated(file):0kB mapped:7476kB dirty:0kB writeback:0kB shmem:2996kB >May 19 08:46:24 ubuntu kernel: [107061.602992] DMA free:22436kB min:24576kB low:30208kB high:35840kB reserved_highatomic:0KB active_anon:2940kB inactive_anon:66276kB active_file:196kB inactive_file:292kB unevictable:741332kB writepending:0kB p>May 19 08:46:24 ubuntu kernel: [107061.603035] lowmem_reserve[]: 0 0 0 0
May 19 08:46:24 ubuntu kernel: [107061.603114] DMA: 1113*4kB (UME) 633*8kB (UME) 296*16kB (UME) 129*32kB (UME) 48*64kB (UME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22860kB
May 19 08:46:24 ubuntu kernel: [107061.603406] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 19 08:46:24 ubuntu kernel: [107061.603428] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 19 08:46:24 ubuntu kernel: [107061.603449] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 19 08:46:24 ubuntu kernel: [107061.603469] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 19 08:46:24 ubuntu kernel: [107061.603489] 2704 total pagecache pages
May 19 08:46:24 ubuntu kernel: [107061.603504] 0 pages in swap cache
May 19 08:46:24 ubuntu kernel: [107061.603518] Swap cache stats: add 0, delete 0, find 0/0
May 19 08:46:24 ubuntu kernel: [107061.603536] Free swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603550] Total swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603565] 242688 pages RAM
May 19 08:46:24 ubuntu kernel: [107061.603580] 0 pages HighMem/MovableOnly
May 19 08:46:24 ubuntu kernel: [107061.603594] 10931 pages reserved
May 19 08:46:24 ubuntu kernel: [107061.603609] 16384 pages cma reserved
May 19 08:46:24 ubuntu kernel: [107061.603624] Tasks state (memory values in pages):
May 19 08:46:24 ubuntu kernel: [107061.603638] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
May 19 08:46:24 ubuntu kernel: [107061.603685] [ 379] 0 379 12038 852 94208 0 -250 systemd-journal
May 19 08:46:24 ubuntu kernel: [107061.603716] [ 406] 0 406 72414 6415 118784 0 -1000 multipathd
May 19 08:46:24 ubuntu kernel: [107061.603745] [ 420] 0 420 5982 942 69632 0 -1000 systemd-udevd
May 19 08:46:24 ubuntu kernel: [107061.603789] [ 553] 103 553 22163 732 77824 0 0 systemd-timesyn
May 19 08:46:24 ubuntu kernel: [107061.603819] [ 612] 100 612 4068 777 73728 0 0 systemd-network
May 19 08:46:24 ubuntu kernel: [107061.603847] [ 614] 101 614 6339 1633 90112 0 0 systemd-resolve
May 19 08:46:24 ubuntu kernel: [107061.603875] [ 625] 102 625 2267 838 57344 0 -900 dbus-daemon
May 19 08:46:24 ubuntu kernel: [107061.603904] [ 629] 0 629 20487 611 65536 0 0 irqbalance
May 19 08:46:24 ubuntu kernel: [107061.603933] [ 634] 0 634 8236 2733 114688 0 0 networkd-dispat
May 19 08:46:24 ubuntu kernel: [107061.603961] [ 640] 104 640 55504 826 81920 0 0 rsyslogd
May 19 08:46:24 ubuntu kernel: [107061.603989] [ 644] 0 644 366640 2855 249856 0 -900 snapd
May 19 08:46:24 ubuntu kernel: [107061.604017] [ 653] 0 653 3887 791 69632 0 0 systemd-logind
May 19 08:46:24 ubuntu kernel: [107061.604045] [ 655] 0 655 3809 626 73728 0 0 wpa_supplicant
May 19 08:46:24 ubuntu kernel: [107061.604073] [ 683] 0 683 1727 501 45056 0 0 cron
May 19 08:46:24 ubuntu kernel: [107061.604100] [ 703] 0 703 27482 2589 110592 0 0 unattended-upgr
May 19 08:46:24 ubuntu kernel: [107061.604128] [ 710] 0 710 1408 126 53248 0 0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604155] [ 712] 0 712 1397 139 49152 0 0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604183] [ 720] 0 720 3788 1039 69632 0 -1000 sshd
May 19 08:46:24 ubuntu kernel: [107061.604211] [ 844] 0 844 559 44 36864 0 0 hciattach
May 19 08:46:24 ubuntu kernel: [107061.604239] [ 856] 0 856 2384 602 61440 0 0 bluetoothd
May 19 08:46:24 ubuntu kernel: [107061.604266] [ 1172] 0 1172 74368 1369 167936 0 0 packagekitd
May 19 08:46:24 ubuntu kernel: [107061.604305] [ 1178] 0 1178 58582 814 94208 0 0 polkitd
May 19 08:46:24 ubuntu kernel: [107061.604336] [ 4481] 0 4481 4596 1078 81920 0 0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604364] [ 4484] 1000 4484 4559 1187 73728 0 0 systemd
May 19 08:46:24 ubuntu kernel: [107061.604391] [ 4485] 1000 4485 42829 1235 110592 0 0 (sd-pam)
May 19 08:46:24 ubuntu kernel: [107061.604421] [ 4571] 1000 4571 4631 881 81920 0 0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604448] [ 4572] 1000 4572 2147 846 53248 0 0 bash
May 19 08:46:24 ubuntu kernel: [107061.604481] [ 4674] 1000 4674 3345 616 61440 0 0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604509] [ 4675] 1000 4675 3345 172 61440 0 0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604536] [ 4676] 0 4676 1725546 180701 1495040 0 0 main
May 19 08:46:24 ubuntu kernel: [107061.604563] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-39.scope,task=main,pid=4676,uid=0
May 19 08:46:24 ubuntu kernel: [107061.604827] Out of memory: Killed process 4676 (main) total-vm:6902184kB, anon-rss:721280kB, file-rss:1524kB, shmem-rss:0kB, UID:0 pgtables:1460kB oom_score_adj:0
May 19 08:46:25 ubuntu systemd[1]: session-39.scope: A process of this unit has been killed by the OOM killer.
Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory? I understand this is most likely not a priority. Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?
main:
Worker