BeyondTrust / pbis-open

BeyondTrust AD Bridge Open is an open-source community project sponsored by BeyondTrust Corporation. It is currently archived and will no longer receive updates. If you are interested in an Enterprise version of this project, please see our AD Bridge product.
https://www.beyondtrust.com/privilege-management/active-directory-bridge
Apache License 2.0
366 stars 94 forks source link

PBIS broken and SSO no longer works #55

Closed jrose84 closed 7 years ago

jrose84 commented 7 years ago

Version: 8.5.2-265 OS/Distro: Solaris 10 Zone Issue/Impact: single sign-on won't work; related services stuck in maintenance mode -bash-3.2$ svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default Note: replace content with your own When reporting an issue it's important that we have as much detail as you can provide. The following is a list of commands to check.

  1. systemctl status lwsmd.service bash-3.2# svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default bash-3.2# svcadm restart lwsmd bash-3.2# svcs lwsmd STATE STIME FMRI maintenance Sep_01 svc:/network/lwsmd:default
  2. /opt/pbis/bin/lwsm list bash-3.2# /opt/pbis/bin/lwsm list Error: LW_ERROR_ERRNO_ECONNREFUSED (40265)
  3. /opt/pbis/domainjoin-cli query `bash-3.2# /opt/pbis/domainjoin-cli query bash: /opt/pbis/domainjoin-cli: No such file or directory You have new mail in /var/mail/root bash-3.2# which domainjoin-cli /usr/bin/domainjoin-cli bash-3.2# domainjoin-cli query

Error: ERROR_FILE_NOT_FOUND [code 0x00000002]`

  1. pbis status bash-3.2# pbis status Failed to query status from LSA service. Error code 2 (ERROR_FILE_NOT_FOUND).
  2. /opt/pbis/bin/enum-users bash-3.2# /opt/pbis/bin/enum-users Failed to enumerate users. Error code 2 (ERROR_FILE_NOT_FOUND).
  3. attach logs
    • /opt/pbis/bin/lwsm set-log-target -p lsass - file /tmp/lsass.log
    • /opt/pbis/bin/lwsm set-log-level -p lsass - debug
    • attach log

Output/Error:

Steps to Reproduce:

  1. install command
  2. Domainjoin command
  3. Command that returns issue
RBoulton-BT commented 7 years ago

Did you try clearing the maintenance state with svcadm clear lwsmd

before trying to restart it?

svcadm enable lwsmd

jrose84 commented 7 years ago

Yes, we've tried clearing, disabling/enabling, refresh/restart.

RBoulton-BT commented 7 years ago

The reason I ask is because your steps above don't show it being cleared and the state is from Sep_01.

So it would be useful to have accurate steps listed for the problem.

For it to drop into Maintenance it indicates the lwsmd may be crashing, do your have any core files or indications of errors in your logs?

As this is a community project we'll probably need to see this reproduced in the latest version, have you tried this on the 8.5.4 release?

jrose84 commented 7 years ago

lwsmd is crashing and not sure how to proceed

bash-3.2# tail -f /var/svc/log/network-lwsmd\:default.log [ Sep 1 12:27:37 Leaving maintenance because clear requested. ] [ Sep 1 12:27:37 Enabled. ] [ Sep 1 12:27:37 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 1 12:27:44 Method "start" exited with status 1 ] [ Sep 4 10:21:14 Leaving maintenance because clear requested. ] [ Sep 4 10:21:14 Enabled. ] [ Sep 4 10:21:14 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 4 10:21:15 Method "start" exited with status 1 ] [ Sep 4 10:25:29 Leaving maintenance because clear requested. ] [ Sep 4 10:25:29 Enabled. ] [ Sep 4 10:25:29 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 4 10:25:30 Method "start" exited with status 1 ]

This is used in a production environment and cannot upgrade at the moment.

RBoulton-BT commented 7 years ago

You could try running /opt/pbis/sbin/lwsmd interactively as root and see if that logs anything useful.

/opt/pbis/sbin/lwsmd

20170904153433:ALWAYS: Logging started 20170904153433:INFO: Likewise Service Manager starting up 20170904153433:INFO:lwsm-ipc: Listener started 20170904153433:VERBOSE: Starting IPC server 20170904153433:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm 20170904153433:INFO:lwsm-ipc: Listener started 20170904153433:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc .....

jrose84 commented 7 years ago

bash-3.2# /opt/pbis/sbin/lwsmd 20170904103542:ALWAYS: Logging started 20170904103542:INFO: Likewise Service Manager starting up 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:VERBOSE: Starting IPC server 20170904103542:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc 20170904103542:INFO:lwsm-ipc: Listener started 20170904103542:VERBOSE: Bootstrapping 20170904103542:INFO: Starting service: lwreg 20170904103542:VERBOSE: Populating service table

RBoulton-BT commented 7 years ago

No errors? Does PBIS work while running it in this way? $ pbis status $ sudo domanjoin-cli query etc

jrose84 commented 7 years ago

bash-3.2# pbis status Failed to query status from LSA service. Error code 2 (ERROR_FILE_NOT_FOUND). bash-3.2# domainjoin-cli query

Error: ERROR_FILE_NOT_FOUND [code 0x00000002]

RBoulton-BT commented 7 years ago

Sorry this isn't making much sense.

You run it interactively as root and get no errors and it continues running? Was the output you listed above from the /opt/pbis/sbin/lwsmd just a subset?

But still you get errors when you try to access pbis commands?

jrose84 commented 7 years ago

PBIS doesn't work, no one can log in with AD credentials. I am trying to troubleshoot the issue. I have provided all the information request. Not sure where to go from there.

RBoulton-BT commented 7 years ago

I am trying to help.

You show above [ Sep 4 10:21:14 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31)

So i suggested you try running /opt/pbis/sbin/lwsmd manually to track down the ERROR_GEN_FAILURE.

You do and report there are no errors running it manually.

At this point if no ERROR_GEN_FAILURE errors are coming from /opt/pbis/sbin/lwsmd the application should be running correctly. But you say it is not.

I have asked for clarification on this and you have not provided any.

RBoulton-BT commented 7 years ago

If you are unhappy with the level of support available for a free community product then I suggest you should consider paying for the enterprise edition.

jrose84 commented 7 years ago

@RBoulton-BT it isn't a matter of happy or unhappy. It's a matter of providing you with the details as I'm aware of them and looking for guidance on how to resolve this issue. It is my understanding that using PBIS on Solaris 10 non-global zones is a supported configuration. Since the issue has been reopened, I am guessing further troubleshooting can be done?

RBoulton-BT commented 7 years ago

If you want to answer my previous question then I'm happy to continue troubleshooting.

Could you confirm what you are seeing when you run lwsmd at the command prompt? Does it display any other messages after 20170904103542:VERBOSE: Populating service table?

Does it stop and drop back to the command prompt or does it just hang there?

If you can provide more information this would also be useful. Has this been working and just stopped working? I assume from your comments about not being able to upgrade it must have been working previously?

Basically the more information, even if you think it unnecessary, you can give us can help identify what is going on.

jrose84 commented 7 years ago

Q: You run it interactively as root and get no errors and it continues running? A: i get no errors, but it doesn't continue running; for example... # ps -ef | grep lw root 28581 20228 0 09:06:56 pts/30 0:00 grep lw # /opt/pbis/sbin/lwsmd --start-as-daemon Error: ERROR_GEN_FAILURE (31) # ps -ef | grep lw root 14859 20228 0 09:07:26 pts/30 0:00 grep lw # /opt/pbis/sbin/lwsmd 20170905090730:ALWAYS: Logging started 20170905090730:INFO: Likewise Service Manager starting up 20170905090730:INFO:lwsm-ipc: Listener started 20170905090730:VERBOSE: Starting IPC server 20170905090730:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsm 20170905090730:INFO:lwsm-ipc: Listener started 20170905090730:INFO:lwsm-ipc: Listening on endpoint /var/lib/pbis/.lwsc 20170905090730:INFO:lwsm-ipc: Listener started 20170905090730:VERBOSE: Bootstrapping 20170905090730:INFO: Starting service: lwreg 20170905090730:VERBOSE: Populating service table # ps -ef | grep lw root 19804 20228 0 09:07:35 pts/30 0:00 grep lw

Q: Was the output you listed above from the /opt/pbis/sbin/lwsmd just a subset? A: no, that was the full output

Q: But still you get errors when you try to access pbis commands? A: yes, single sign-on does not work and error messages occur when running pbis commands

jrose84 commented 7 years ago

Does that sufficiently answer your questions? If not, let me know what else you need for clarification.

RBoulton-BT commented 7 years ago

This is going to be an iterative process.

Does lwsmd produce a core file?

You didn't mention in your reply whether this has ever worked on this environment?

jrose84 commented 7 years ago

Yes, this has worked previously. We implemented PBIS this past April and has worked up until Sept 1. Yes, I see a core file generated after running the command. I ran pmap and pstack against it and don't see anything conclusive. Suggestions?

RBoulton-BT commented 7 years ago

When you say nothing conclusive, you mean there's no symbols just ??????

I'm guessing the answer is going to be that age old "nothing changed" but if it was working and stopped usually something changed. Any idea what might have changed on Sep 1st? Patches applied etc

jrose84 commented 7 years ago

core 'core' of 10411:   prstat -p  1 1
 ffffffff7dc39cdc strtol (0, ffffffff7ffff898, a, 2400, 0, ffffffff7f108334) + 44
 0000000100008108 Atoi (0, 1000090a8, 0, 1, 1, 0) + 18
 00000001000057c0 ???????? (10010acc0, 100009, 100000, 100116bd0, 100009000, 100009)
 0000000100005ee8 main (10010a, ffffffff7ffffac8, 5, 10010ae20, 100005, 4000) + 3fc
 00000001000024bc _start (5, ffffffff7ffffac8, ffffffff7ffffaf8, 0, 103d9c, ffffffff7ffffac8) + 17c```
jrose84 commented 7 years ago

core 'core' of 10411:   prstat -p  1 1
0000000100000000         40K r-x--  /usr/bin/sparcv9/prstat
000000010010A000          8K rwx--  /usr/bin/sparcv9/prstat
000000010010C000       3024K rwx--    [ heap ]
FFFFFFFF7D900000          8K r-x--  /platform/sun4v/lib/sparcv9/libc_psr.so.1
FFFFFFFF7DA00000         64K rwx--
FFFFFFFF7DB00000          8K rwx--
FFFFFFFF7DC00000       1216K r-x--  /lib/sparcv9/libc.so.1
FFFFFFFF7DD30000         56K r-x--  /lib/sparcv9/libc.so.1
FFFFFFFF7DE3E000          8K rwx--  /lib/sparcv9/libc.so.1
FFFFFFFF7DE40000         64K rwx--  /lib/sparcv9/libc.so.1
FFFFFFFF7DE50000          8K rwx--  /lib/sparcv9/libc.so.1
FFFFFFFF7E000000          8K rw---
FFFFFFFF7E100000          8K rw---
FFFFFFFF7E200000          8K rwx--
FFFFFFFF7E300000          8K rwx--
FFFFFFFF7E400000         24K rwx--
FFFFFFFF7E500000          8K rwx--
FFFFFFFF7E600000       1216K r-x--
FFFFFFFF7E730000         56K r-x--
FFFFFFFF7E83E000          8K rwx--
FFFFFFFF7E840000         64K rwx--
FFFFFFFF7E850000          8K rwx--
FFFFFFFF7EA00000          8K r-x--
FFFFFFFF7EB02000          8K rwx--
FFFFFFFF7ED00000          8K rw---
FFFFFFFF7EE00000          8K rw---
FFFFFFFF7EF00000          8K rwx--
FFFFFFFF7F000000          8K rwx--
FFFFFFFF7F0E2000          8K r----
FFFFFFFF7F100000        192K r-x--  /lib/sparcv9/ld.so.1
FFFFFFFF7F130000         48K r-x--  /lib/sparcv9/ld.so.1
FFFFFFFF7F23C000         24K rwx--  /lib/sparcv9/ld.so.1
FFFFFFFF7F300000        192K r-x--
FFFFFFFF7F330000         48K r-x--
FFFFFFFF7F43C000         24K rwx--
FFFFFFFF7F51C000          8K r----
FFFFFFFF7F5DE000          8K r----
FFFFFFFF7F600000         48K r-x--
FFFFFFFF7F70C000          8K rwx--
FFFFFFFF7FFF0000         64K rw---    [ stack ]
         total         6640K```
jrose84 commented 7 years ago

To my knowledge, no patches or OS level changes have been done. However, that doesn't mean there isn't a rogue element involved. The problem is, how to resolve. One other thing I can see is that there appears to be a missing file in /var/lib/pbis

ls -l /var/lib/pbis total 120 drwx------ 2 root root 5 Sep 1 11:30 db -rw-r--r-- 1 root root 145 Aug 29 03:48 krb5-affinity.conf -rw-r--r-- 1 root other 39916 Dec 6 2016 lwconfig.xml -rw-r--r-- 1 root other 13244 Dec 6 2016 lwreport.xml drwxr-xr-x 2 root other 3 Apr 13 17:17 rpc drwxr-x--- 2 root root 2 Sep 1 11:37 syslog-reaper drwxr-xr-x 2 root root 4 Apr 11 10:13 uninstall

On other servers where PBIS works, I can see there is a krb5cc_lsass.$DOMAIN file where the "problem" server does not have this file

RBoulton-BT commented 7 years ago

I assume you originally followed the instructions in the guide Installing the Agent in Solaris Zones.

We're just looking into whether running /opt/pbis/bin/postinstall.sh again is safe and would resolve any issues.

RBoulton-BT commented 7 years ago

Yes /var/lib/pbis is where the PBIS configuration is stored. If this is missing then it could well explain these issues.

jrose84 commented 7 years ago

Can you provide the documentation (URL) for install the agent in Solaris Zones? I want to confirm that for you. As for the missing file in /var/lib/pbis - is there a way to regenerate it since it is missing?

RBoulton-BT commented 7 years ago

https://github.com/BeyondTrust/pbis-open/wiki/docs/PBIS_Installation_Guide.pdf see page 37

jrose84 commented 7 years ago

Thanks. I reviewed the documentation and fairly certain the answer is no. Keep in mind we've only installed PBIS agent in non-global zones and have ran the PBIS install only from said zone. Not sure if that changes anything (I have to assume not since PBIS was working up until Sept 1).

jrose84 commented 7 years ago

The missing file has been restored from backup, but the service is still failing bash-3.2# tail -f /var/svc/log/network-lwsmd\:default.log [ Sep 5 10:35:34 Enabled. ] [ Sep 5 10:35:35 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 5 10:35:35 Method "start" exited with status 1 ] [ Sep 5 10:36:03 Leaving maintenance because clear requested. ] [ Sep 5 10:36:03 Enabled. ] [ Sep 5 10:36:03 Executing start method ("/opt/pbis/sbin/lwsmd --start-as-daemon") ] Error: ERROR_GEN_FAILURE (31) [ Sep 5 10:36:03 Method "start" exited with status 1 ] [ Sep 5 10:36:10 Rereading configuration. ]

rbest-bt commented 7 years ago

What type of zone is configured? Whole-Root or sparse?

The behavior seems like a zone not joined in a broken state. I would recommend clearing the maintenance state, then running: /opt/pbis/bin/postinstall.sh

Then issuing a new domainjoin.

jrose84 commented 7 years ago

Whole-root zone. What impact would running the postinstall script have? Since this is a production system, do we need to consider scheduling a maintenance window or is the impact minimal? Will a zone restart be required like the initial install?

Also, since the state cannot be cleared, can I assume you mean disable the service?

rbest-bt commented 7 years ago

Running postinstall should not cause any new issues. If this does not work to get lwsmd started then I would recommend doing a reinstall:

This should get it back to a working state.

RBoulton-BT commented 7 years ago

On the restart question. The install/domainjoin always recommends a restart due to the fact we are changing the NSS providers and some long running services may not always pickup the change on all platforms. This is a "best practice" and means we have fewer support calls where some long running service don't see the users and we then have to say reboot.

In your case (and most installations) as this has already been installed and running there should be no requirement for a restart.

Basically we want the service to be stopped but not tagged in Maintenance mode. So I guess to avoid the service trying and failing to start on a clear we may need to disable it.

jrose84 commented 7 years ago

Executing the post-install.sh yielded errors Log output = https://pastebin.com/vvR7LtFY

From what I am seeing, it seems to me like we're going to need to do a fresh install, which, to my recollection required a zone reboot to go into affect. Please review log output and confirm.

jrose84 commented 7 years ago

Also, on a side note, there are non-production non-global zones where PBIS is broken as well. I am testing the above process there first and noticing the following when I attempt to join to the domain

In syslog... Sep 5 11:56:19 kwidev30 lsass: [ID 889572 daemon.error] [lsass] Failed to run provider specific request (request code = 12, provider = 'lsa-activedirectory-provider') -> error = 2692, symbol = NERR_SetupNotJoined, client pid = -1

When I execute the join command bash-3.2# domainjoin-cli join kwitzg.com administrator Error: LW_ERROR_ERRNO_ELOOP [code 0x00009d02]

My concern is that if the issue cannot be resolved in Dev, it would at minimal not be fixed in prod or, at worst, break production further (we noticed SSH authentication broke entirely on Sept 1 and were forced to restart the zone, which is when we noticed PBIS agent was broken)

jrose84 commented 7 years ago

Still not working after latest round of troubleshooting as follows:

  1. uninstall & purge PBIS
  2. reboot zone
  3. install pbis

same error as previous commnent

RBoulton-BT commented 7 years ago

I've just searched our bug database and nobody has ever reported that error message. LW_ERROR_ERRNO_ELOOP means we are getting an ELOOP errno which seems to relate to "Too many levels of symbolic links".

jrose84 commented 7 years ago

OK - so how do I proceed? As for as I can tell, there are no excessive symbolic links that I can see.

RBoulton-BT commented 7 years ago

What we usually do for this type of problem is to run the /opt/pbis/libexec/pbis-support.pl script with the -dj option to get a debug log of the domainjoin-cli process.

It will prompt you for the various domainjoin-cli parameters and produce a support pack with debug logging enabled.

If you let us have the /tmp/pbis-support.tar.gz we can see if it sheds any light on the join issue.

jrose84 commented 7 years ago

Executed bash-3.2# /opt/pbis/libexec/pbis-support.pl -dj...

jrose84 commented 7 years ago

Will advise once script finishes

jrose84 commented 7 years ago

@RBoulton-BT to confirm I am doing the following: /opt/pbis/libexec/pbis-support.pl -dj Enter domain and applicable credentials to join to domain Let perl script finish and provide compressed tarball to you.

Follow-up question...how shall I submit the tarball?

RBoulton-BT commented 7 years ago

We have an openproject@beyondtrust.com email address that we can access. If you either attach it to that (or if it's too large a link to it) we should be able to access it.

jrose84 commented 7 years ago

tarball sent to openproject@beyondtrust.com

RBoulton-BT commented 7 years ago

So looking at the log for this I believe it may be the default_keytab_name in the /etc/krb5.conf resulting in the ELOOP issue. So could you check what that is set to point to and confirm if the symlinks for that file/dir look ok?

jrose84 commented 7 years ago

-bash-3.2$ more /etc/krb5.conf [libdefaults] default_tgs_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC default_tkt_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC preferred_enctypes = AES256-CTS AES128-CTS RC4-HMAC DES-CBC-MD5 DES-CBC-CRC dns_lookup_kdc = true pkinit_kdc_hostname = pkinit_anchors = DIR:/var/lib/pbis/trusted_certs pkinit_cert_match = &&msScLogin pkinit_eku_checking = kpServerAuth pkinit_win2k_require_binding = false pkinit_identities = PKCS11:/opt/pbis/lib/libpkcs11.so default_keytab_name = /etc/krb5/krb5.keytab [domain_realm] .kwitzg.com = KWITZG.COM [realms] KWITZG.COM = { auth_to_local = RULE:1:$0\$1s/^KWITZG.COM/KWITZG/ auth_to_local = DEFAULT } [capaths] [appdefaults] pam = { mappings = KWITZG\(.) $1@KWITZG.COM forwardable = true validate = true } httpd = { mappings = KWITZG\(.) $1@KWITZG.COM reverse_mappings = (.*)@KWITZG.COM KWITZG\$1 } -bash-3.2$ ls -l /etc/krb5/krb5.keytab lrwxrwxrwx 1 root root 16 Jan 25 2017 /etc/krb5/krb5.keytab -> /etc/krb5.keytab -bash-3.2$ ls -l /etc/krb5.keytab lrwxrwxrwx 1 root root 21 Sep 5 11:27 /etc/krb5.keytab -> /etc/krb5/krb5.keytab

jrose84 commented 7 years ago

I see what you're referring to regarding an "infinite" loop...question is why/how did this happen and how to fix it. Suggestions?

RBoulton-BT commented 7 years ago

It looks like our assumption in the code is that /etc/krb5/krb5.keytab is the real file and we need /etc/krb5.keytab to point to the real file so we create the symlink /etc/krb5.keytab -> /etc/krb5/krb5.keytab.

Unfortunately it looks like in your case you actually had the reverse and we created an infinite loop. I'll add a bug for this so we check correctly for this scenario.

For now I think the easiest solution is to just rm the /etc/krb5/krb5.keytab link and use touch to create an empty file /etc/krb5/krb5.keytab and retry the join.

jrose84 commented 7 years ago

PBIS is working correctly on the DEV host. I will add this to internal documentation on troubleshooting needed going forward. Thanks for the assistance.