IBM / ibmichroot

A set of scripts to facilitate the use of chroot-based containers for IBM i
MIT License
22 stars 10 forks source link

fatal: Out of memory? mmap failed: No such device #30

Closed abmusse closed 7 years ago

abmusse commented 8 years ago

Original report by Aaron Bartell (Bitbucket: aaronbartell, GitHub: aaronbartell).


I have installed ibmichroot on a customer's machine. I created five custom chroot environments, one for each developer. Each environment is Node.js + Git.

Git worked great for a couple weeks for all developers and then two days ago it started throwing the below error message in each and every chroot environment. Including an environment that hasn't been in use since we originally created it.

fatal: Out of memory? mmap failed: No such device

I've had this issue before with other customers and it was remedied by creating chroot environments (seclusion and selection of exact binaries and libs), so now I am scratching my head on how five separate chroot environments all started getting the same Git error at the same time.

This occurs for all Git commands (i.e. git init in new folder, git status for existing repo, etc) in all the chroot environments. Further, Git is not installed outside of chroot (shouldn't make a difference, but wanted to note).

Google searches have turned up a couple things** but none of them have resolved the issue.

**

Thoughts on how to further debug this?

abmusse commented 8 years ago

Original comment by Aaron Bartell (Bitbucket: aaronbartell, GitHub: aaronbartell).


Thanks for guiding this adventure. My first hopping of grass:

% mkdir git_dbx && cd git_dbx
% dbx -d 100 git
Type 'help' for help.
cannot read git
enter object file name (default is `a.out', ^D to exit): libc.a  <---- tried libc.a as a wild guess
cannot read libc.a
enter object file name (default is `a.out', ^D to exit):   <---- tried a.out 
cannot read a.out

I am reading through the AIX dbx docs to see what I can try next.

NOTE: I am trying this on my machine, where git works, before trying on customer's machine.

abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


Oh one more ... dbx may have nest depth issue on your open source git project ... use -d 100.

#!shell

bash-4.3$ dbx -d 100 zzmini
Type 'help' for help.
reading symbolic information ...
(dbx) stopi in mmap
[1] stopi in glink.mmap
(dbx) run do this here tex
[1] stopped in glink.mmap at 0x100006a0
0x100006a0 (mmap)    81820058         lwz   r12,0x58(r2)
(dbx) cont
i am not a frog.
i am a toad.
so there.
sniffle.

execution completed
(dbx) quit
abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


Oh, one trick with dbx. If your git has parameters (> git do this here thing tex), then you need to start dbx with just the main program, set the stop, then run with parameters.

#!shell

bash-4.3$ dbx zzmini
Type 'help' for help.
reading symbolic information ...
(dbx) stopi in mmap
[1] stopi in glink.mmap
(dbx) run do this here tex
[1] stopped in glink.mmap at 0x100006a0
0x100006a0 (mmap)    81820058         lwz   r12,0x58(r2)
(dbx) 
abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


Well, let's see if we can make you into a pase object code debug person (should be interesting).

The following zzmini example demonstrates dbx stop in libc.a export syscall mmap. In your case, of course, substitute git as main program, not zzmini. As you can see, dbx stops at mmap start every time called (only once this example), then i use stepi, to instruction step (assembler step), until it reachs the actual branch to system call mmap kernal/slic (4e800420 bctr), then one more stepi and system call kernel takes over (we do not get to see the kernel), and the return is back into our main program (git or zzmini). We can see the address of the file mmap in $r3=0x30000000, meaning the mmap was successful. In your case, i would expect to see $r3=0xffffffff or -1, indicating the git mmap fails. After seeing 0xffffffff (-1), we can find the address of errno at 0x2ff22ff8, and, dump some memory via dbx 0x2ff22ff8 / X, first 4 bytes are the errno in hex PASE file /usr/include/errno.h, where we would see our ENOMEM.

So, i am assuming git will open a ton of files, which, leads us to the eventual mmap fail. So, debug skills 101, make sure you record (on paper), each stop mmap return address $r3=0xnnnnnnnn, thereby we can watch as git slowly fills all the memory of a 32 bit process. This MAY be a loooooooooong process, wherein, you may want to fall alseep at the wheel, but stiff upper lip grasshopper, you must suffer for your answer. Or, then, agin, maybe something happens quickly, and we still have an answer.

At this point, with all your good data, recorded. Myabe we are ah, ha, moment. Or perhaps, with dbx stuck at ENOMEM, we MAY want to look a STRSST, and peek around in the kernel. You will need my help for that ...

Good luck new prince of c code.

#!shell

bash-4.3$ dbx zzmini
Type 'help' for help.
reading symbolic information ...
(dbx) stopi in mmap  <--- set my assembler stop (object code libc.a)
[1] stopi in glink.mmap   
(dbx) cont
[1] stopped in glink.mmap at 0x100006a0
0x100006a0 (mmap)    81820058         lwz   r12,0x58(r2)
(dbx) stepi
stopped in glink.mmap at 0x100006a4
0x100006a4 (mmap+0x4) 90410014         stw   r2,0x14(r1)
(dbx) stepi
stopped in glink.mmap at 0x100006a8
0x100006a8 (mmap+0x8) 800c0000         lwz   r0,0x0(r12)
(dbx) stepi
stopped in glink.mmap at 0x100006ac
0x100006ac (mmap+0xc) 804c0004         lwz   r2,0x4(r12)
(dbx) stepi
stopped in glink.mmap at 0x100006b0
0x100006b0 (mmap+0x10) 7c0903a6       mtctr   r0
(dbx) stepi
stopped in glink.mmap at 0x100006b4
0x100006b4 (mmap+0x14) 4e800420        bctr
(dbx) stepi <--- i am going into kernel/slic (and, you do not get to see that, you, you, user)
stopped in main at 0x10000454 <--- i am back in your program (zzmini or git) 
0x10000454 (main+0xd4) 80410014         lwz   r2,0x14(r1)
(dbx) registers
  $r0:0x00003608  $stkp:0x2ff22b80   $toc:0x00415e7d    $r3:0x30000000  <- $r3 where mapped file, 0xffffffff fail
  $r4:0x00000031    $r5:0x00000000    $r6:0x00000000    $r7:0x00000008  
  $r8:0x80556000    $r9:0x2200000a   $r10:0x051f8000   $r11:0x051f8f30  
 $r12:0x0000f032   $r13:0xdeadbeef   $r14:0x00000001   $r15:0x2ff22ce0  
 $r16:0x2ff22ce8   $r17:0xdeadbeef   $r18:0xdeadbeef   $r19:0xf0174f6c  
 $r20:0xdeadbeef   $r21:0xdeadbeef   $r22:0xdeadbeef   $r23:0xdeadbeef  
 $r24:0xdeadbeef   $r25:0xdeadbeef   $r26:0xdeadbeef   $r27:0x0000000a  
 $r28:0xf010cb70   $r29:0xd010cd80   $r30:0x00000003   $r31:0x100007b8  
 $iar:0x10000454   $msr:0x0002f032    $cr:0x2200000a  $link:0x10000454  
 $ctr:0x30000000   $xer:0x34000000    $mq:0x00000000  
          Condition status = 0:e 1:e 7:le 
        [unset $noflregs to view floating point registers]
        [unset $novregs to view vector registers]
in main at 0x10000454
0x10000454 (main+0xd4) 80410014         lwz   r2,0x14(r1)
(dbx) print &errno
0x2ff22ff8 
(dbx) 0x2ff22ff8 / 10X <-- dump errno, 1st four hex bytes error number in /usr/include/errnoh.h)
0x2ff22ff8:  00000000 2ff22ff8 00000000 00000000
0x2ff23008:  00000000 00000000 00000000 00000000
0x2ff23018:  00000000 00000000
(dbx) quit
bash-4.3$
abmusse commented 8 years ago

Original comment by Aaron Bartell (Bitbucket: aaronbartell, GitHub: aaronbartell).


ping me Monday.

$ ping ranger.rochester.ibm.com

What would you recommend for next steps?

abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


Ok, something bad in git-y-up requires debug. I am going home for the weekend, ping me Monday.

abmusse commented 8 years ago

Original comment by Aaron Bartell (Bitbucket: aaronbartell, GitHub: aaronbartell).


Same error with export LDR_CNTRL=MAXDATA=0@DSA

abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


Try this setting ...

#!shell

$ export LDR_CNTRL=MAXDATA=0@DSA
$ git (operation that fails)
abmusse commented 8 years ago

Original comment by Aaron Bartell (Bitbucket: aaronbartell, GitHub: aaronbartell).


Tried both heap sizes and neither fixed the issue.

All above fails, we need to debug, easy way, you need my help, and, access to STRSST to look into the kernel (Muah-ha-ha-ha-ha!).

How should we proceed if I can only reproduce the error on the customer's machine? Should I follow these autopsy/STRSST steps you've authored on YiPs? This is a production machine.

abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


BTW -- related link refers to different version of git out of memory. Therefore, ignoring our message about mmap, maybe git-y-up needs more heap LDR_CNTRL=MAXDATA=0x60000000 (try less heap first). Anyway, LDR_CNTRL has many settings including DSA settings to reclaim the shared object segments away form shared libx.a/thing.so (D, F segments). Aka, LDR_CNTRL experimentation may work for you, before jumping to the git 64-bit train.

abmusse commented 8 years ago

Original comment by Tony Cairns (Bitbucket: rangercairns, GitHub: rangercairns).


We risk public debugging by water cooler ... cool ... how about them republicans ...

So, taken face value, git tripped over itself ... see this related link

Possible git 32 bit is using HUGE memory model, allocated too much to heap space, as mmap/shmat and heap/alloc/new share 32 bits. You can dial down HUGE heap by using environment variable LDR_CNTRL=MAXDATA=0x20000000 (or maybe 0x10000000).

So, git is a mmap pig (reference link). Another answer, compile git 64 bit, then use env var PASE_MAXSHR64 to let git64 take over the world with mmap (Muah-ha-ha-ha-ha!)

Git, oh git, maybe leaking ... but ... well ... that would mean a git error ... those guys are never wrong (Muah-ha-ha-ha-ha!).

All above fails, we need to debug, easy way, you need my help, and, access to STRSST to look into the kernel (Muah-ha-ha-ha-ha!).