UPPMAX / irods

Project for implementing an iRODS infrastructure on UPPMAX / SciLifeLab
8 stars 3 forks source link

irodsReServer random? segfault #18

Closed jhagberg closed 12 years ago

jhagberg commented 12 years ago

On u5 test iRODS. Rule server segfault looks like random or just after reload of core.re...

Apr 17 12:26:36 u5 kernel: irodsReServer[24725]: segfault at 10 ip 0000000000520fe9 sp 00007fffbd8282a0 error 4 in irodsReServer[400000+1cb000]

jhagberg commented 12 years ago

Here is what I got from the gdb and the core file.

gdb irodsReServer core.24725 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/irods/iRODS/server/bin/irodsReServer...done.
BFD: Warning: /opt/irods/iRODS/server/bin/core.24725 is truncated: expected core file size >= 233754624, found: 10952.
[New Thread 24725]
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `irodsReServer'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000520fe9 in createCondIndex (r=Cannot access memory at address 0x7fffbd8282a8
) at /opt/irods/iRODS/server/re/src/index.c:82
82                      Node *ruleNode = rd->node;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb)  where
#0  0x0000000000520fe9 in createCondIndex (r=Cannot access memory at address 0x7fffbd8282a8
) at /opt/irods/iRODS/server/re/src/index.c:82
Cannot access memory at address 0x7fffbd828398
(gdb) list
77                  Node *condExp = NULL;
78                  Node *params = NULL;
79  
80                  while(currIndexNode != NULL) {
81                      RuleDesc *rd = getRuleDesc(currIndexNode->ruleIndex);
82                      Node *ruleNode = rd->node;
83                      if(!(
84                              rd->ruleType == RK_REL
85                      )) {
86                          finishIndexNode = currIndexNode;
brainstorm commented 12 years ago

Wow, pointer issues on rule nodes... bad mojo :-/

Does it fail on the same line all the time ? Can you reproduce the bug with a simple proof of concept (i.e cmdline) ?

This is to discard memory/hardware problems with our testing server... if it's deterministic we can at least report it to irods-chat.

jhagberg commented 12 years ago

Another segfault on same bin from 4 april

gdb irodsReServer core.26724 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/irods/iRODS/server/bin/irodsReServer...done.
BFD: Warning: /opt/irods/iRODS/server/bin/core.26724 is truncated: expected core file size >= 232472576, found: 10952.

warning: exec file is newer than core file.
[New Thread 26724]
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `irodsReServer'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000005201d5 in convertResToString (res0=Cannot access memory at address 0x7fffb8ac5318
) at /opt/irods/iRODS/server/re/src/conversion.c:553
553                         snprintf(res + strlen(res), 1024 - strlen(res), "%s=%s;", kvp->keyWord[i],kvp->value[i]);
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb) where
#0  0x00000000005201d5 in convertResToString (res0=Cannot access memory at address 0x7fffb8ac5318
) at /opt/irods/iRODS/server/re/src/conversion.c:553
Cannot access memory at address 0x7fffb8ac5ca8
(gdb) list
548                 if(strcmp(type, KeyValPair_MS_T)==0) {
549                     keyValPair_t *kvp = (keyValPair_t *) RES_UNINTER_STRUCT(res0);
550                     snprintf(res, 1024, "KeyValue[%d]:", kvp->len);
551                     int i;
552                     for(i=0;i<kvp->len;i++) {
553                         snprintf(res + strlen(res), 1024 - strlen(res), "%s=%s;", kvp->keyWord[i],kvp->value[i]);
554                     }
555 
556                 } else if (strcmp(type, BUF_LEN_MS_T) == 0 ) {
557                     snprintf(res + strlen(res), 1024 - strlen(res),"%d",*(int*)res0->param->inOutStruct);
´´
jhagberg commented 12 years ago

Then we also have some segfaults on irodsAgent but that maybe should be another issue.

brainstorm commented 12 years ago

Could be a hardware issue, can you guys please run a memtest or similar today by night ?

jhagberg commented 12 years ago

Just before the segfault this was written to the log

tail -f /opt/irods/iRODS/server/log/reLog.2012.04.06
Apr 17 12:22:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:22:35 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:23:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:23:35 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:24:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:24:36 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:25:06 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:25:36 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:26:06 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:26:36 pid:24725 NOTICE: chkAndResetRule: reconf file /opt/irods/iRODS/server/config/reConfigs/core.re has been changed. re-initializing
dahlo commented 12 years ago

Yes, very weird. A memtest to rule out hw issues would be nice to safely rule that out.

jhagberg commented 12 years ago

The hardware is old on u5...

I will try copy all the address...

irodsAgent 26 mar 12.27 core.24119 address 0x7fff12263428

29 mar 07.33 core.24874 0x7fff8930f998

29 mar 07.33 core.24885 0x7fff40dd4f68

29 mar 07.49 core.25169 0x7fff4818a758

29 mar 07.52 core.25187 0x7fff9be948a8

irodsReServer 4 apr 19.09 core.26724 0x7fffb8ac5ca8

17 apr 12.26 core.24725 0x7fffbd828398

No address is exactly the same.

brainstorm commented 12 years ago

Of course they're not the same, since many years ago Linux uses ASLR:

http://en.wikipedia.org/wiki/Address_space_layout_randomization

Please run a memtest when possible, that should rule out the hw issues as dahlo pointed out.

jhagberg commented 12 years ago

ups thats true. Good you are back from vacation!

jhagberg commented 12 years ago

@samuell Have you had a chance to run memtest on u5?

brainstorm commented 12 years ago

Guys, looks like we're not alone here:

http://groups.google.com/group/irod-chat/browse_thread/thread/f2756284fe29b874#

brainstorm commented 12 years ago

Issue being handled by developers in the main mailing list (iRODS Chat), closing...

jhagberg commented 12 years ago

This looks still to be an issue.

samuell commented 12 years ago

@jhagberg el @pontus Får jag assigna till nån av er?

samuell commented 12 years ago

How can we handle that itrim does not complete because of 0byte files? ... can we do a workaround for this until we get a proper fix?

pontus commented 12 years ago

https://github.com/UPPMAX/irods/issues/18#issuecomment-8519919: Visst.

IIUC, things work fine as long as you're not using delayed rule, so an irule from crontab should work fine.

pontus commented 12 years ago

The init script I copied to start at boot had an ulimit -c 81920 to limit core size files, removed it so we'll hopefully receive better core files in the future.

samuell commented 12 years ago

True, or force the -purgec flag, so that cache is purged immediately, and the user himself get to handle what to do because of the error...

samuell commented 12 years ago

Let's keep the issue in the milestone at least until we have implemented a workaround.

samuell commented 12 years ago

Ok, can we add an irule command to the crontab? @pontus do you fix? ... or else if I get the command that should be done from @jhagberg I can add it.

jhagberg commented 12 years ago

Better to restart reServer with cron and report problems and findings to iRODS chat if we hit more segfaults.

samuell commented 12 years ago

(Answer to @brainstorm in #49): Yes, and any debugging of this issue is welcome, I guess :) ... but we probably will work around it for the "in production" milestone.

brainstorm commented 12 years ago

Ok, core files I can "gdb -c" against are welcome then, it seems hard to reproduce outside your env :-S

samuell commented 12 years ago

@brainstorm Yes, that's true

pontus commented 12 years ago

2012/9/13 Samuel Lampa notifications@github.com

Ok, can we add an irule command to the crontab? @pontus do you fix? ... or else if I get the command that should be done from @jhagberg I can add it.

Crontab job added to run trimming rule nightly.

samuell commented 12 years ago

@pontus Great! ... then moving the issue to the next milestone, for hopefully a full fix.

samuell commented 12 years ago

Someone suspected that it might be related to changing core.re on a running server. Workaround would be to always do a restart after a change in core.re or change it only on a closed server.

samuell commented 12 years ago

The workaround should be good enough, so closing.

brainstorm commented 12 years ago

Can I get at least one core file before implementing that workaround?