Closed jhagberg closed 12 years ago
Here is what I got from the gdb and the core file.
gdb irodsReServer core.24725
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/irods/iRODS/server/bin/irodsReServer...done.
BFD: Warning: /opt/irods/iRODS/server/bin/core.24725 is truncated: expected core file size >= 233754624, found: 10952.
[New Thread 24725]
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `irodsReServer'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000520fe9 in createCondIndex (r=Cannot access memory at address 0x7fffbd8282a8
) at /opt/irods/iRODS/server/re/src/index.c:82
82 Node *ruleNode = rd->node;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb) where
#0 0x0000000000520fe9 in createCondIndex (r=Cannot access memory at address 0x7fffbd8282a8
) at /opt/irods/iRODS/server/re/src/index.c:82
Cannot access memory at address 0x7fffbd828398
(gdb) list
77 Node *condExp = NULL;
78 Node *params = NULL;
79
80 while(currIndexNode != NULL) {
81 RuleDesc *rd = getRuleDesc(currIndexNode->ruleIndex);
82 Node *ruleNode = rd->node;
83 if(!(
84 rd->ruleType == RK_REL
85 )) {
86 finishIndexNode = currIndexNode;
Wow, pointer issues on rule nodes... bad mojo :-/
Does it fail on the same line all the time ? Can you reproduce the bug with a simple proof of concept (i.e cmdline) ?
This is to discard memory/hardware problems with our testing server... if it's deterministic we can at least report it to irods-chat.
Another segfault on same bin from 4 april
gdb irodsReServer core.26724
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/irods/iRODS/server/bin/irodsReServer...done.
BFD: Warning: /opt/irods/iRODS/server/bin/core.26724 is truncated: expected core file size >= 232472576, found: 10952.
warning: exec file is newer than core file.
[New Thread 26724]
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `irodsReServer'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000005201d5 in convertResToString (res0=Cannot access memory at address 0x7fffb8ac5318
) at /opt/irods/iRODS/server/re/src/conversion.c:553
553 snprintf(res + strlen(res), 1024 - strlen(res), "%s=%s;", kvp->keyWord[i],kvp->value[i]);
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb) where
#0 0x00000000005201d5 in convertResToString (res0=Cannot access memory at address 0x7fffb8ac5318
) at /opt/irods/iRODS/server/re/src/conversion.c:553
Cannot access memory at address 0x7fffb8ac5ca8
(gdb) list
548 if(strcmp(type, KeyValPair_MS_T)==0) {
549 keyValPair_t *kvp = (keyValPair_t *) RES_UNINTER_STRUCT(res0);
550 snprintf(res, 1024, "KeyValue[%d]:", kvp->len);
551 int i;
552 for(i=0;i<kvp->len;i++) {
553 snprintf(res + strlen(res), 1024 - strlen(res), "%s=%s;", kvp->keyWord[i],kvp->value[i]);
554 }
555
556 } else if (strcmp(type, BUF_LEN_MS_T) == 0 ) {
557 snprintf(res + strlen(res), 1024 - strlen(res),"%d",*(int*)res0->param->inOutStruct);
´´
Then we also have some segfaults on irodsAgent but that maybe should be another issue.
Could be a hardware issue, can you guys please run a memtest or similar today by night ?
Just before the segfault this was written to the log
tail -f /opt/irods/iRODS/server/log/reLog.2012.04.06
Apr 17 12:22:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:22:35 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:23:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:23:35 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:24:05 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:24:36 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:25:06 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:25:36 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:26:06 pid:24725 NOTICE: reServerMain: checking the queue for jobs
Apr 17 12:26:36 pid:24725 NOTICE: chkAndResetRule: reconf file /opt/irods/iRODS/server/config/reConfigs/core.re has been changed. re-initializing
Yes, very weird. A memtest to rule out hw issues would be nice to safely rule that out.
The hardware is old on u5...
I will try copy all the address...
irodsAgent 26 mar 12.27 core.24119 address 0x7fff12263428
29 mar 07.33 core.24874 0x7fff8930f998
29 mar 07.33 core.24885 0x7fff40dd4f68
29 mar 07.49 core.25169 0x7fff4818a758
29 mar 07.52 core.25187 0x7fff9be948a8
irodsReServer 4 apr 19.09 core.26724 0x7fffb8ac5ca8
17 apr 12.26 core.24725 0x7fffbd828398
No address is exactly the same.
Of course they're not the same, since many years ago Linux uses ASLR:
http://en.wikipedia.org/wiki/Address_space_layout_randomization
Please run a memtest when possible, that should rule out the hw issues as dahlo pointed out.
ups thats true. Good you are back from vacation!
@samuell Have you had a chance to run memtest on u5?
Guys, looks like we're not alone here:
http://groups.google.com/group/irod-chat/browse_thread/thread/f2756284fe29b874#
Issue being handled by developers in the main mailing list (iRODS Chat), closing...
This looks still to be an issue.
@jhagberg el @pontus Får jag assigna till nån av er?
How can we handle that itrim does not complete because of 0byte files? ... can we do a workaround for this until we get a proper fix?
https://github.com/UPPMAX/irods/issues/18#issuecomment-8519919: Visst.
IIUC, things work fine as long as you're not using delayed rule, so an irule from crontab should work fine.
The init script I copied to start at boot had an ulimit -c 81920 to limit core size files, removed it so we'll hopefully receive better core files in the future.
True, or force the -purgec flag, so that cache is purged immediately, and the user himself get to handle what to do because of the error...
Let's keep the issue in the milestone at least until we have implemented a workaround.
Ok, can we add an irule command to the crontab? @pontus do you fix? ... or else if I get the command that should be done from @jhagberg I can add it.
Better to restart reServer with cron and report problems and findings to iRODS chat if we hit more segfaults.
(Answer to @brainstorm in #49): Yes, and any debugging of this issue is welcome, I guess :) ... but we probably will work around it for the "in production" milestone.
Ok, core files I can "gdb -c" against are welcome then, it seems hard to reproduce outside your env :-S
@brainstorm Yes, that's true
2012/9/13 Samuel Lampa notifications@github.com
Ok, can we add an irule command to the crontab? @pontus do you fix? ... or else if I get the command that should be done from @jhagberg I can add it.
Crontab job added to run trimming rule nightly.
@pontus Great! ... then moving the issue to the next milestone, for hopefully a full fix.
Someone suspected that it might be related to changing core.re on a running server. Workaround would be to always do a restart after a change in core.re or change it only on a closed server.
The workaround should be good enough, so closing.
Can I get at least one core file before implementing that workaround?
On u5 test iRODS. Rule server segfault looks like random or just after reload of core.re...
Apr 17 12:26:36 u5 kernel: irodsReServer[24725]: segfault at 10 ip 0000000000520fe9 sp 00007fffbd8282a0 error 4 in irodsReServer[400000+1cb000]