Closed jrose84 closed 7 years ago
Did you try clearing the maintenance state with svcadm clear lwsmd
before trying to restart it?
svcadm enable lwsmd
Yes, we've tried clearing, disabling/enabling, refresh/restart.
The reason I ask is because your steps above don't show it being cleared and the state is from Sep_01.
So it would be useful to have accurate steps listed for the problem.
For it to drop into Maintenance it indicates the lwsmd may be crashing, do your have any core files or indications of errors in your logs?
As this is a community project we'll probably need to see this reproduced in the latest version, have you tried this on the 8.5.4 release?
lwsmd is crashing and not sure how to proceed
bash-3.2# tail -f /var/svc/log/network-lwsmd\:default.log [ Sep 1 12:27:37 Leaving maintenance because clear requested. ] [ Sep 1 12:27:37 Enabled. ] [ Sep 1 12:27:37 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 1 12:27:44 Method "start" exited with status 1 ] [ Sep 4 10:21:14 Leaving maintenance because clear requested. ] [ Sep 4 10:21:14 Enabled. ] [ Sep 4 10:21:14 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 4 10:21:15 Method "start" exited with status 1 ] [ Sep 4 10:25:29 Leaving maintenance because clear requested. ] [ Sep 4 10:25:29 Enabled. ] [ Sep 4 10:25:29 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 4 10:25:30 Method "start" exited with status 1 ]
This is used in a production environment and cannot upgrade at the moment.
You could try running /opt/pbis/sbin/lwsmd interactively as root and see if that logs anything useful.
20170904153433:ALWAYS: Logging started 20170904153433:INFO: Likewise Service Manager starting up 20170904153433:INFO:lwsm-ipc: Listener started 20170904153433:VERBOSE: Starting IPC server 20170904153433:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm 20170904153433:INFO:lwsm-ipc: Listener started 20170904153433:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc .....
bash-3.2# /opt/pbis/sbin/lwsmd 20170904103542:ALWAYS: Logging started 20170904103542:INFO: Likewise Service Manager starting up 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:VERBOSE: Starting IPC server 20170904103542:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:VERBOSE: Bootstrapping 20170904103542:INFO: Starting service: lwreg 20170904103542:VERBOSE: Populating service table
No errors? Does PBIS work while running it in this way? $ pbis status $ sudo domanjoin-cli query etc
bash-3.2# pbis status Failed to query status from LSA service. Error code 2 (ERROR_FILE_NOT_FOUND). bash-3.2# domainjoin-cli query
Error: ERROR_FILE_NOT_FOUND [code 0x00000002]
Sorry this isn't making much sense.
You run it interactively as root and get no errors and it continues running? Was the output you listed above from the /opt/pbis/sbin/lwsmd just a subset?
But still you get errors when you try to access pbis commands?
PBIS doesn't work, no one can log in with AD credentials. I am trying to troubleshoot the issue. I have provided all the information request. Not sure where to go from there.
I am trying to help.
You show above [ Sep 4 10:21:14 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31)
So i suggested you try running /opt/pbis/sbin/lwsmd manually to track down the ERROR_GEN_FAILURE.
You do and report there are no errors running it manually.
At this point if no ERROR_GEN_FAILURE errors are coming from /opt/pbis/sbin/lwsmd the application should be running correctly. But you say it is not.
I have asked for clarification on this and you have not provided any.
If you are unhappy with the level of support available for a free community product then I suggest you should consider paying for the enterprise edition.
@RBoulton-BT it isn't a matter of happy or unhappy. It's a matter of providing you with the details as I'm aware of them and looking for guidance on how to resolve this issue. It is my understanding that using PBIS on Solaris 10 non-global zones is a supported configuration. Since the issue has been reopened, I am guessing further troubleshooting can be done?
If you want to answer my previous question then I'm happy to continue troubleshooting.
Could you confirm what you are seeing when you run lwsmd at the command prompt? Does it display any other messages after 20170904103542:VERBOSE: Populating service table?
Does it stop and drop back to the command prompt or does it just hang there?
If you can provide more information this would also be useful. Has this been working and just stopped working? I assume from your comments about not being able to upgrade it must have been working previously?
Basically the more information, even if you think it unnecessary, you can give us can help identify what is going on.
Q: You run it interactively as root and get no errors and it continues running?
A: i get no errors, but it doesn't continue running; for example...
# ps -ef | grep lw
root 28581 20228 0 09:06:56 pts/30 0:00 grep lw
# /opt/pbis/sbin/lwsmd --start-as-daemon
Error: ERROR_GEN_FAILURE (31)
# ps -ef | grep lw
root 14859 20228 0 09:07:26 pts/30 0:00 grep lw
# /opt/pbis/sbin/lwsmd
20170905090730:ALWAYS: Logging started
20170905090730:INFO: Likewise Service Manager starting up
20170905090730:INFO:lwsm-ipc: Listener started
20170905090730:VERBOSE: Starting IPC server
20170905090730:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm
20170905090730:INFO:lwsm-ipc: Listener started
20170905090730:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc
20170905090730:INFO:lwsm-ipc: Listener started
20170905090730:VERBOSE: Bootstrapping
20170905090730:INFO: Starting service: lwreg
20170905090730:VERBOSE: Populating service table
# ps -ef | grep lw
root 19804 20228 0 09:07:35 pts/30 0:00 grep lw
Q: Was the output you listed above from the /opt/pbis/sbin/lwsmd just a subset? A: no, that was the full output
Q: But still you get errors when you try to access pbis commands? A: yes, single sign-on does not work and error messages occur when running pbis commands
Does that sufficiently answer your questions? If not, let me know what else you need for clarification.
This is going to be an iterative process.
Does lwsmd produce a core file?
You didn't mention in your reply whether this has ever worked on this environment?
Yes, this has worked previously. We implemented PBIS this past April and has worked up until Sept 1. Yes, I see a core file generated after running the command. I ran pmap and pstack against it and don't see anything conclusive. Suggestions?
When you say nothing conclusive, you mean there's no symbols just ??????
I'm guessing the answer is going to be that age old "nothing changed" but if it was working and stopped usually something changed. Any idea what might have changed on Sep 1st? Patches applied etc
core 'core' of 10411: prstat -p 1 1
ffffffff7dc39cdc strtol (0, ffffffff7ffff898, a, 2400, 0, ffffffff7f108334) + 44
0000000100008108 Atoi (0, 1000090a8, 0, 1, 1, 0) + 18
00000001000057c0 ???????? (10010acc0, 100009, 100000, 100116bd0, 100009000, 100009)
0000000100005ee8 main (10010a, ffffffff7ffffac8, 5, 10010ae20, 100005, 4000) + 3fc
00000001000024bc _start (5, ffffffff7ffffac8, ffffffff7ffffaf8, 0, 103d9c, ffffffff7ffffac8) + 17c```
core 'core' of 10411: prstat -p 1 1
0000000100000000 40K r-x-- /usr/bin/sparcv9/prstat
000000010010A000 8K rwx-- /usr/bin/sparcv9/prstat
000000010010C000 3024K rwx-- [ heap ]
FFFFFFFF7D900000 8K r-x-- /platform/sun4v/lib/sparcv9/libc_psr.so.1
FFFFFFFF7DA00000 64K rwx--
FFFFFFFF7DB00000 8K rwx--
FFFFFFFF7DC00000 1216K r-x-- /lib/sparcv9/libc.so.1
FFFFFFFF7DD30000 56K r-x-- /lib/sparcv9/libc.so.1
FFFFFFFF7DE3E000 8K rwx-- /lib/sparcv9/libc.so.1
FFFFFFFF7DE40000 64K rwx-- /lib/sparcv9/libc.so.1
FFFFFFFF7DE50000 8K rwx-- /lib/sparcv9/libc.so.1
FFFFFFFF7E000000 8K rw---
FFFFFFFF7E100000 8K rw---
FFFFFFFF7E200000 8K rwx--
FFFFFFFF7E300000 8K rwx--
FFFFFFFF7E400000 24K rwx--
FFFFFFFF7E500000 8K rwx--
FFFFFFFF7E600000 1216K r-x--
FFFFFFFF7E730000 56K r-x--
FFFFFFFF7E83E000 8K rwx--
FFFFFFFF7E840000 64K rwx--
FFFFFFFF7E850000 8K rwx--
FFFFFFFF7EA00000 8K r-x--
FFFFFFFF7EB02000 8K rwx--
FFFFFFFF7ED00000 8K rw---
FFFFFFFF7EE00000 8K rw---
FFFFFFFF7EF00000 8K rwx--
FFFFFFFF7F000000 8K rwx--
FFFFFFFF7F0E2000 8K r----
FFFFFFFF7F100000 192K r-x-- /lib/sparcv9/ld.so.1
FFFFFFFF7F130000 48K r-x-- /lib/sparcv9/ld.so.1
FFFFFFFF7F23C000 24K rwx-- /lib/sparcv9/ld.so.1
FFFFFFFF7F300000 192K r-x--
FFFFFFFF7F330000 48K r-x--
FFFFFFFF7F43C000 24K rwx--
FFFFFFFF7F51C000 8K r----
FFFFFFFF7F5DE000 8K r----
FFFFFFFF7F600000 48K r-x--
FFFFFFFF7F70C000 8K rwx--
FFFFFFFF7FFF0000 64K rw--- [ stack ]
total 6640K```
To my knowledge, no patches or OS level changes have been done. However, that doesn't mean there isn't a rogue element involved. The problem is, how to resolve. One other thing I can see is that there appears to be a missing file in /var/lib/pbis
ls -l /var/lib/pbis total 120 drwx------ 2 root root 5 Sep 1 11:30 db -rw-r--r-- 1 root root 145 Aug 29 03:48 krb5-affinity.conf -rw-r--r-- 1 root other 39916 Dec 6 2016 lwconfig.xml -rw-r--r-- 1 root other 13244 Dec 6 2016 lwreport.xml drwxr-xr-x 2 root other 3 Apr 13 17:17 rpc drwxr-x--- 2 root root 2 Sep 1 11:37 syslog-reaper drwxr-xr-x 2 root root 4 Apr 11 10:13 uninstall
On other servers where PBIS works, I can see there is a krb5cc_lsass.$DOMAIN file where the "problem" server does not have this file
I assume you originally followed the instructions in the guide Installing the Agent in Solaris Zones.
We're just looking into whether running /opt/pbis/bin/postinstall.sh again is safe and would resolve any issues.
Yes /var/lib/pbis is where the PBIS configuration is stored. If this is missing then it could well explain these issues.
Can you provide the documentation (URL) for install the agent in Solaris Zones? I want to confirm that for you. As for the missing file in /var/lib/pbis - is there a way to regenerate it since it is missing?
Thanks. I reviewed the documentation and fairly certain the answer is no. Keep in mind we've only installed PBIS agent in non-global zones and have ran the PBIS install only from said zone. Not sure if that changes anything (I have to assume not since PBIS was working up until Sept 1).
The missing file has been restored from backup, but the service is still failing
bash-3.2# tail -f /var/svc/log/network-lwsmd\:default.log
[ Sep 5 10:35:34 Enabled. ]
[ Sep 5 10:35:35 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ]
Error: ERROR_GEN_FAILURE (31)
[ Sep 5 10:35:35 Method "start" exited with status 1 ]
[ Sep 5 10:36:03 Leaving maintenance because clear requested. ]
[ Sep 5 10:36:03 Enabled. ]
[ Sep 5 10:36:03 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ]
Error: ERROR_GEN_FAILURE (31)
[ Sep 5 10:36:03 Method "start" exited with status 1 ]
[ Sep 5 10:36:10 Rereading configuration. ]
What type of zone is configured? Whole-Root or sparse?
The behavior seems like a zone not joined in a broken state. I would recommend clearing the maintenance state, then running:
/opt/pbis/bin/postinstall.sh
Then issuing a new domainjoin.
Whole-root zone. What impact would running the postinstall script have? Since this is a production system, do we need to consider scheduling a maintenance window or is the impact minimal? Will a zone restart be required like the initial install?
Also, since the state cannot be cleared, can I assume you mean disable the service?
Running postinstall should not cause any new issues. If this does not work to get lwsmd started then I would recommend doing a reinstall:
This should get it back to a working state.
On the restart question. The install/domainjoin always recommends a restart due to the fact we are changing the NSS providers and some long running services may not always pickup the change on all platforms. This is a "best practice" and means we have fewer support calls where some long running service don't see the users and we then have to say reboot.
In your case (and most installations) as this has already been installed and running there should be no requirement for a restart.
Basically we want the service to be stopped but not tagged in Maintenance mode. So I guess to avoid the service trying and failing to start on a clear we may need to disable it.
Executing the post-install.sh yielded errors Log output = https://pastebin.com/vvR7LtFY
From what I am seeing, it seems to me like we're going to need to do a fresh install, which, to my recollection required a zone reboot to go into affect. Please review log output and confirm.
Also, on a side note, there are non-production non-global zones where PBIS is broken as well. I am testing the above process there first and noticing the following when I attempt to join to the domain
In syslog...
Sep 5 11:56:19 kwidev30 lsass: [ID 889572 daemon.error] [lsass] Failed to run provider specific request (request code = 12, provider = 'lsa-activedirectory-provider') -> error = 2692, symbol = NERR_SetupNotJoined, client pid = -1
When I execute the join command
bash-3.2# domainjoin-cli join kwitzg.com administrator
Error: LW_ERROR_ERRNO_ELOOP [code 0x00009d02]
My concern is that if the issue cannot be resolved in Dev, it would at minimal not be fixed in prod or, at worst, break production further (we noticed SSH authentication broke entirely on Sept 1 and were forced to restart the zone, which is when we noticed PBIS agent was broken)
Still not working after latest round of troubleshooting as follows:
same error as previous commnent
I've just searched our bug database and nobody has ever reported that error message. LW_ERROR_ERRNO_ELOOP means we are getting an ELOOP errno which seems to relate to "Too many levels of symbolic links".
OK - so how do I proceed? As for as I can tell, there are no excessive symbolic links that I can see.
What we usually do for this type of problem is to run the /opt/pbis/libexec/pbis-support.pl script with the -dj option to get a debug log of the domainjoin-cli process.
It will prompt you for the various domainjoin-cli parameters and produce a support pack with debug logging enabled.
If you let us have the /tmp/pbis-support
Executed bash-3.2# /opt/pbis/libexec/pbis-support.pl -dj
...
Will advise once script finishes
@RBoulton-BT to confirm I am doing the following:
/opt/pbis/libexec/pbis-support.pl -dj
Enter domain and applicable credentials to join to domain
Let perl script finish and provide compressed tarball to you.
Follow-up question...how shall I submit the tarball?
We have an openproject@beyondtrust.com email address that we can access. If you either attach it to that (or if it's too large a link to it) we should be able to access it.
tarball sent to openproject@beyondtrust.com
So looking at the log for this I believe it may be the default_keytab_name in the /etc/krb5.conf resulting in the ELOOP issue. So could you check what that is set to point to and confirm if the symlinks for that file/dir look ok?
-bash-3.2$ more /etc/krb5.conf
[libdefaults]
default_tgs_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC
default_tkt_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC
preferred_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC
dns_lookup_kdc = true
pkinit_kdc_hostname =
I see what you're referring to regarding an "infinite" loop...question is why/how did this happen and how to fix it. Suggestions?
It looks like our assumption in the code is that /etc/krb5/krb5.keytab is the real file and we need /etc/krb5.keytab to point to the real file so we create the symlink /etc/krb5.keytab -> /etc/krb5/krb5.keytab.
Unfortunately it looks like in your case you actually had the reverse and we created an infinite loop. I'll add a bug for this so we check correctly for this scenario.
For now I think the easiest solution is to just rm the /etc/krb5/krb5.keytab link and use touch to create an empty file /etc/krb5/krb5.keytab and retry the join.
PBIS is working correctly on the DEV host. I will add this to internal documentation on troubleshooting needed going forward. Thanks for the assistance.
Version: 8.5.2-265 OS/Distro: Solaris 10 Zone Issue/Impact: single sign-on won't work; related services stuck in maintenance mode
-bash-3.2$ svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default
Note: replace content with your own When reporting an issue it's important that we have as much detail as you can provide. The following is a list of commands to check.bash-3.2# svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default bash-3.2# svcadm restart lwsmd bash-3.2# svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default
bash-3.2# /opt/pbis/bin/lwsm list Error: LW_ERROR_ERRNO_ECONNREFUSED (40265)
Error: ERROR_FILE_NOT_FOUND [code 0x00000002]`
bash-3.2# pbis status Failed to query status from LSA service. Error code 2 (ERROR_FILE_NOT_FOUND).
bash-3.2# /opt/pbis/bin/enum-users Failed to enumerate users. Error code 2 (ERROR_FILE_NOT_FOUND).
Output/Error:
Steps to Reproduce: