dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15k stars 4.67k forks source link

Regression 2.1.13: AuthenticationException in Centos 7 using WCF #31110

Open Cronan opened 4 years ago

Cronan commented 4 years ago

Describe the bug I use the WCF client from .NET Core to access Windows WCF services from Linux. Everything works correctly using .NET Core 2.1.3

Upgrading to 2.1.13 results in the exception below when calling the WCF client. I also see the same problem with .NET Core 2.2 latest or 3.0 latest.

Expected behavior I use the WCF client from .NET Core to access Windows WCF services from Linux. Everything works correctly using .NET Core 2.1.3

I expected this to continue working in newer versions of .NET Core.

Additional context I use kinit before making this call to ensure that I'm authenticated correctly. Am I missing some dependencies, or could this be something else?

Linux Version

cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

.NET Core Version

dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   2.1.802
 Commit:    177d0b2525

Runtime Environment:
 OS Name:     centos
 OS Version:  7
 OS Platform: Linux
 RID:         centos.7-x64
 Base Path:   /usr/share/dotnet/sdk/2.1.802/

Host (useful for support):
  Version: 2.1.13
  Commit:  1a165a1588

.NET Core SDKs installed:
  2.1.802 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.13 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.13 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.13 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Full Stack trace

System.ComponentModel.Win32Exception: GSSAPI operation failed with error - Unspecified GSS failure.  Minor code may provide more information (Server not found in Kerberos database).

The above exception was the direct cause of the following exception:

System.Security.Authentication.AuthenticationException: Authentication failed, see inner exception.
   at System.Net.Security.NegoState.StartSendAuthResetSignal(LazyAsyncResult lazyResult, Byte[] message, Exception exception)
   at System.Net.Security.NegoState.StartSendBlob(Byte[] message, LazyAsyncResult lazyResult)
   at System.Net.Security.NegoState.ProcessAuthentication(LazyAsyncResult lazyResult)
   at System.Net.Security.NegotiateStream.BeginAuthenticateAsClient(NetworkCredential credential, ChannelBinding binding, String targetName, ProtectionLevel requiredProtectionLevel, TokenImpersonationLevel allowedImpersonationLevel, AsyncCallback asyncCallback, Object asyncState)
   at System.Net.Security.NegotiateStream.BeginAuthenticateAsClient(NetworkCredential credential, String targetName, ProtectionLevel requiredProtectionLevel, TokenImpersonationLevel allowedImpersonationLevel, AsyncCallback asyncCallback, Object asyncState)
   at System.Net.Security.NegotiateStream.<>c__DisplayClass27_0.<AuthenticateAsClientAsync>b__0(AsyncCallback callback, Object state)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl(Func`3 beginMethod, Func`2 endFunction, Action`1 endAction, Object state, TaskCreationOptions creationOptions)
   at System.Net.Security.NegotiateStream.AuthenticateAsClientAsync(NetworkCredential credential, String targetName, ProtectionLevel requiredProtectionLevel, TokenImpersonationLevel allowedImpersonationLevel)
   at System.ServiceModel.Channels.WindowsStreamSecurityUpgradeProvider.WindowsStreamSecurityUpgradeInitiator.OnInitiateUpgradeAsync(Stream stream, OutWrapper`1 remoteSecurity)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
System.ServiceModel.Security.SecurityNegotiationException: Authentication failed, see inner exception.
   at System.ServiceModel.Channels.WindowsStreamSecurityUpgradeProvider.WindowsStreamSecurityUpgradeInitiator.OnInitiateUpgradeAsync(Stream stream, OutWrapper`1 remoteSecurity)
   at System.ServiceModel.Channels.StreamSecurityUpgradeInitiatorBase.InitiateUpgradeAsync(Stream stream)
   at System.ServiceModel.Channels.ConnectionUpgradeHelper.InitiateUpgradeAsync(StreamUpgradeInitiator upgradeInitiator, OutWrapper`1 connectionWrapper, ClientFramingDecoder decoder, IDefaultCommunicationTimeouts defaultTimeouts, TimeSpan timeout)
   at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.SendPreambleAsync(IConnection connection, ArraySegment`1 preamble, TimeSpan timeout)
   at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnectionAsync(TimeSpan timeout)
   at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpenAsync(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.OnOpenAsyncInternal(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.System.ServiceModel.IAsyncCommunicationObject.OpenAsync(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.OpenAsyncInternal(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.Runtime.TaskHelpers.CallActionAsync[TArg](Action`1 action, TArg argument)
   at System.ServiceModel.Channels.CommunicationObject.OpenOtherAsync(ICommunicationObject other, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.OnOpenAsync(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.OnOpenAsyncInternal(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.System.ServiceModel.IAsyncCommunicationObject.OpenAsync(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.OpenAsyncInternal(TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.CallOpenOnce.System.ServiceModel.Channels.ServiceChannel.ICallOnce.Call(ServiceChannel channel, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade)
   at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(MethodCall methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(MethodInfo targetMethod, Object[] args)
--- End of stack trace from previous location where exception was thrown ---
   at System.Reflection.DispatchProxyGenerator.Invoke(Object[] args)
   at generatedProxy_1.GetById(Int32 )
   at MyApp.Client.Connections.BaseConnection`1.ExecuteOnSingleUseChannel[T](Func`2 activity)
Cronan commented 4 years ago

Crossposted from here

mconnew commented 4 years ago

Moved to corefx repo as this isn't a bug/change introduced in WCF code. @davidsh, are you aware of any changes in NegotiateStream between 2.1.3 and 2.1.13 which would have caused this? @Cronan, if this works with 2.1.3 on the same host, then that should rule out any problems with kinit. I also wouldn't expect an error saying it couldn't find the server in the database if there was a problem with kinit as that implies it's communicating with the Kerberos server successfully.

Cronan commented 4 years ago

Thanks for the reply, and thanks for moving it to the correct place. Yes, I’ve tested this on several hosts, both docker containers and Linux headnodes, and the behaviour is consistent. Let me know right away if there’s anything I can provide to help diagnose the issue.

On 8 Oct 2019, at 21:52, Matt Connew notifications@github.com wrote:

 Moved to corefx repo as this isn't a bug/change introduced in WCF code. @davidsh, are you aware of any changes in NegotiateStream between 2.1.3 and 2.1.13 which would have caused this? @Cronan, if this works with 2.1.3 on the same host, then that should rule out any problems with kinit. I also wouldn't expect an error saying it couldn't find the server in the database if there was a problem with kinit as that implies it's communicating with the Kerberos server successfully.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mconnew commented 4 years ago

@Cronan, it might be useful to find the exact version which caused the break. I wasn't going to suggest that as it can be a lot of work and it's possible that the appropriate devs might know which changes may have broken this, but if you have it working in docker, it might not be a lot of work. Especially if you do a binary search through the versions. Something quicker that would likely be useful is to capture a Kerberos trace for when it's working and when it's not working. You can do this by setting the environment variable KRB5_TRACE before you run your code to capture Kerberos tracing. You can do one of the following:

Output log to stdout

export KRB5_TRACE = /dev/stdout

Output log to file

export KRB5_TRACE = ~/krb5_trace.log
davidsh commented 4 years ago

@davidsh, are you aware of any changes in NegotiateStream between 2.1.3 and 2.1.13 which would have caused this?

Thanks for bringing this issue to my attention. Yes, there were changes between 2.1.3 and 2.1.13. We ported some fixes from .NET Core 3.0 into the latest servicing releases for 2.1 and 2.2. These were high priority Linux Kerberos related fixes.

See: dotnet/corefx#40109

I'll need to dig into your particular scenario to see why it appears to have been regressed by these changes.

davidsh commented 4 years ago

Can you please provide a repro for this? Or describe in great detail the scenario, networking environment and especially how Kerberos is being used. What OS is the server? What OS is the client?

For the Linux client, please include the krb5.conf file on the machine and a KRB5_TRACE.LOG as described above. But, please don't post any confidential information to this GitHub issue. If necessary, you can mask off any information in the traces that is confidential.

Cronan commented 4 years ago

Thank you, both great suggestions, I’ll try first thing tomorrow.

I’m finding it hard to find up to date Kerberos rpms, and the details of working with the tarballs are not very complete, do you have some tips for trying other versions that might be quicker?

On 8 Oct 2019, at 22:53, Matt Connew notifications@github.com wrote:

 @Cronan, it might be useful to find the exact version which caused the break. I wasn't going to suggest that as it can be a lot of work and it's possible that the appropriate devs might know which changes may have broken this, but if you have it working in docker, it might not be a lot of work. Especially if you do a binary search through the versions. Something quicker that would likely be useful is to capture a Kerberos trace for when it's working and when it's not working. You can do this by setting the environment variable KRB5_TRACE before you run your code to capture Kerberos tracing. You can do one of the following:

Output log to stdout

export KRB5_TRACE = /dev/stdout Output log to file

export KRB5_TRACE = ~/krb5_trace.log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mconnew commented 4 years ago

@Cronan, I wouldn't try changing your Kerberos stack at this point, it will just add noise. Things were working with .NET Core 2.1.3 so unless we find a reason to believe there's an issue with the version of Kerberos you are running with, just leave everything which isn't .NET Core in the same state as when it's working with .NET Core 2.1.3.

mconnew commented 4 years ago

@davidsh, He's provided details of the client in the initial bug description. He also explained the server is a Windows WCF service. Does the specific Windows version matter?

@Cronan, for your WCF client, are you explicitly specifying a service identity? You would have a reference to the one of the classes SpnEndpointIdentity, UpnEndpointIdentity or DnsEndpointIdentity. If you don't have a reference to any of these in your code, we default to using a targetName of host/hostname where hostname is extracted from the uri of the service you are connecting to. So depending on your the address you provided to WCF it might not be a FQDN.

davidsh commented 4 years ago

@davidsh, He's provided details of the client in the initial bug description. He also explained the server is a Windows WCF service. Does the specific Windows version matter?

No. But we need a repro. At the very least, snippets of WCF code for the server and client and how the service is configured.

Also, can you please get KRB5_TRACES for both the working scenario (2.1.3) and broken scenario (2.1.13)?

As a workaround can you try setting this environment variable on the Linux client? It will enable the 'legacy' HTTP stack (uses libcurl on Linux). I'm curious if it will help with this problem at least as a workaround for now:

export DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0
mconnew commented 4 years ago

@davidsh, this is not using HTTP, this is NegotiateStream.

davidsh commented 4 years ago

@davidsh, this is not using HTTP, this is NegotiateStream.

Oh. Good point. Yes, setting that environment variable won't change anything.

Cronan commented 4 years ago

@Cronan, I wouldn't try changing your Kerberos stack at this point, it will just add noise. Things were working with .NET Core 2.1.3 so unless we find a reason to believe there's an issue with the version of Kerberos you are running with, just leave everything which isn't .NET Core in the same state as when it's working with .NET Core 2.1.3.

Sorry, I actually meant Centos rpms, I was typing that on a phone ...

Cronan commented 4 years ago
cat /etc/krb5.conf
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = MY.DOMAIN.COM
 dns_lookup_realm = false
 dns_lookup_kdc = true
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true
 default_tkt_enctypes = rc4-hmac
 default_tgs_enctypes = rc4-hmac

[realms]
 MY.DOMAIN.COM = {
  kdc = uk-ldap.MY.DOMAIN.COM:88
  admin_server = MY.DOMAIN.COM:749
  default_domain = MY.DOMAIN.COM
 }

[domain_realm]
 .myotherdomain.com = MY.DOMAIN.COM
 MY.DOMAIN.COM = MY.DOMAIN.COM
 .MY.DOMAIN.COM = MY.DOMAIN.COM
 myotherdomain.com = MY.DOMAIN.COM
Cronan commented 4 years ago

KRB5_TRACE.LOG for 2.1.3 (working scenario):

cat krb5_trace.log
[9407] 1570605553.815700: ccselect module realm chose cache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with client principal icronyn@MY.DOMAIN.COM for server principal tcp/mywindowsservice-prod.MY.DOMAIN.COM:8184@MY.DOMAIN.COM
[9407] 1570605553.815701: Getting credentials icronyn@MY.DOMAIN.COM -> tcp/mywindowsservice-prod.MY.DOMAIN.COM:8184@MY.DOMAIN.COM using ccache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc
[9407] 1570605553.815702: Retrieving icronyn@MY.DOMAIN.COM -> tcp/mywindowsservice-prod.MY.DOMAIN.COM:8184@MY.DOMAIN.COM from FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with result: 0/Success
[9407] 1570605553.815704: Creating authenticator for icronyn@MY.DOMAIN.COM -> tcp/mywindowsservice-prod.MY.DOMAIN.COM:8184@MY.DOMAIN.COM, seqnum 823849255, subkey rc4-hmac/23EC, session key rc4-hmac/B264
[9407] 1570605553.815709: Read AP-REP, time 1570605553.815705, subkey rc4-hmac/23EC, seqnum 1116169193
Cronan commented 4 years ago

KRB5_TRACE.LOG for 2.1.13 (failing scenario):

cat krb5_trace.log 
[16826] 1570607597.439200: ccselect module realm chose cache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with client principal icronyn@MY.DOMAIN.COM for server principal tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM
[16826] 1570607597.439201: Getting credentials icronyn@MY.DOMAIN.COM -> tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM using ccache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc
[16826] 1570607597.439202: Retrieving icronyn@MY.DOMAIN.COM -> tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM from FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with result: -1765328243/Matching credential not found (filename: /var/opt/thinlinc/sessions/icronyn/10/krb5cc)
[16826] 1570607597.439203: Retrieving icronyn@MY.DOMAIN.COM -> krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM from FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with result: 0/Success
[16826] 1570607597.439204: Starting with TGT for client realm: icronyn@MY.DOMAIN.COM -> krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM
[16826] 1570607597.439205: Requesting tickets for tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM, referrals on
[16826] 1570607597.439206: Generated subkey for TGS request: rc4-hmac/1EBF
[16826] 1570607597.439207: etypes requested in TGS request: rc4-hmac
[16826] 1570607597.439209: Encoding request body and padata into FAST request
[16826] 1570607597.439210: Sending request (5328 bytes) to MY.DOMAIN.COM
[16826] 1570607597.439211: Resolving hostname uk-ldap.MY.DOMAIN.COM
[16826] 1570607597.439212: Initiating TCP connection to stream 10.194.10.221:88
[16826] 1570607597.439213: Sending TCP request to stream 10.194.10.221:88
[16826] 1570607597.439214: Received answer (406 bytes) from stream 10.194.10.221:88
[16826] 1570607597.439215: Terminating TCP connection to stream 10.194.10.221:88
[16826] 1570607597.439216: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[16826] 1570607597.439217: No URI records found
[16826] 1570607597.439218: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[16826] 1570607597.439219: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[16826] 1570607597.439220: No SRV records found
[16826] 1570607597.439221: Response was not from master KDC
[16826] 1570607597.439222: Decoding FAST response
[16826] 1570607597.439223: TGS request result: -1765328377/Server not found in Kerberos database
[16826] 1570607597.439224: Requesting tickets for tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM, referrals off
[16826] 1570607597.439225: Generated subkey for TGS request: rc4-hmac/D1CF
[16826] 1570607597.439226: etypes requested in TGS request: rc4-hmac
[16826] 1570607597.439228: Encoding request body and padata into FAST request
[16826] 1570607597.439229: Sending request (5328 bytes) to MY.DOMAIN.COM
[16826] 1570607597.439230: Resolving hostname uk-ldap.MY.DOMAIN.COM
[16826] 1570607597.439231: Initiating TCP connection to stream 10.192.10.221:88
[16826] 1570607597.439232: Sending TCP request to stream 10.192.10.221:88
[16826] 1570607597.439233: Received answer (406 bytes) from stream 10.192.10.221:88
[16826] 1570607597.439234: Terminating TCP connection to stream 10.192.10.221:88
[16826] 1570607597.439235: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[16826] 1570607597.439236: No URI records found
[16826] 1570607597.439237: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[16826] 1570607597.439238: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[16826] 1570607597.439239: No SRV records found
[16826] 1570607597.439240: Response was not from master KDC
[16826] 1570607597.439241: Decoding FAST response
[16826] 1570607597.439242: TGS request result: -1765328377/Server not found in Kerberos database
[17052] 1570607597.686051: Getting initial credentials for icronyn@MY.DOMAIN.COM
[17052] 1570607597.686053: Sending unauthenticated request
[17052] 1570607597.686054: Sending request (199 bytes) to MY.DOMAIN.COM
[17052] 1570607597.686055: Resolving hostname uk-ldap.MY.DOMAIN.COM
[17052] 1570607597.686056: Sending initial UDP request to dgram 10.192.10.222:88
[17052] 1570607597.686057: Received answer (207 bytes) from dgram 10.192.10.222:88
[17052] 1570607597.686058: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[17052] 1570607597.686059: No URI records found
[17052] 1570607597.686060: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[17052] 1570607597.686061: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[17052] 1570607597.686062: No SRV records found
[17052] 1570607597.686063: Response was not from master KDC
[17052] 1570607597.686064: Received error from KDC: -1765328359/Additional pre-authentication required
[17052] 1570607597.686067: Preauthenticating using KDC method data
[17052] 1570607597.686068: Processing preauth types: 16, 15, 11, 19, 2
[17052] 1570607597.686069: Selected etype info: etype rc4-hmac, salt "", params ""
[17052] 1570607597.686070: PKINIT client has no configured identity; giving up
[17052] 1570607597.686071: PKINIT client has no configured identity; giving up
[17052] 1570607597.686072: Preauth module pkinit (16) (real) returned: 22/Invalid argument
[17052] 1570607597.686073: PKINIT client ignoring draft 9 offer from RFC 4556 KDC
[17052] 1570607597.686074: Preauth module pkinit (15) (real) returned: -1765328360/Preauthentication failed
[17052] 1570607597.686075: AS key obtained for encrypted timestamp: rc4-hmac/E3F4
[17052] 1570607597.686077: Encrypted timestamp (for 1570607597.820889): plain 301AA011180F32303139313030393037353331375AA10502030C8699, encrypted 92D1E0DCC1611B46145C9DA2A5AB09A6D2BBF2F7F7195A523D12DF00A7F030DFA688A310727421687A2D3DADC17969D0D7A665BF
[17052] 1570607597.686078: Preauth module encrypted_timestamp (2) (real) returned: 0/Success
[17052] 1570607597.686079: Produced preauth for next request: 2
[17052] 1570607597.686080: Sending request (275 bytes) to MY.DOMAIN.COM
[17052] 1570607597.686081: Resolving hostname uk-ldap.MY.DOMAIN.COM
[17052] 1570607597.686082: Sending initial UDP request to dgram 10.194.0.87:88
[17052] 1570607597.686083: Received answer (122 bytes) from dgram 10.194.0.87:88
[17052] 1570607597.686084: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[17052] 1570607597.686085: No URI records found
[17052] 1570607597.686086: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[17052] 1570607597.686087: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[17052] 1570607597.686088: No SRV records found
[17052] 1570607597.686089: Response was not from master KDC
[17052] 1570607597.686090: Received error from KDC: -1765328332/Response too big for UDP, retry with TCP
[17052] 1570607597.686091: Request or response is too big for UDP; retrying with TCP
[17052] 1570607597.686092: Sending request (275 bytes) to MY.DOMAIN.COM (tcp only)
[17052] 1570607597.686093: Resolving hostname uk-ldap.MY.DOMAIN.COM
[17052] 1570607597.686094: Initiating TCP connection to stream 10.192.10.222:88
[17052] 1570607597.686095: Sending TCP request to stream 10.192.10.222:88
[17052] 1570607597.686096: Received answer (5100 bytes) from stream 10.192.10.222:88
[17052] 1570607597.686097: Terminating TCP connection to stream 10.192.10.222:88
[17052] 1570607597.686098: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[17052] 1570607597.686099: No URI records found
[17052] 1570607597.686100: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[17052] 1570607597.686101: No SRV records found
[17052] 1570607597.686102: Response was not from master KDC
[17052] 1570607597.686103: Salt derived from principal: MY.DOMAIN.COMicronyn
[17052] 1570607597.686104: AS key determined by preauth: rc4-hmac/E3F4
[17052] 1570607597.686105: Decrypted AS reply; session key is: rc4-hmac/5E7D
[17052] 1570607597.686106: FAST negotiation: unavailable
[17052] 1570607597.686107: Initializing FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with default princ icronyn@MY.DOMAIN.COM
[17052] 1570607597.686108: Storing icronyn@MY.DOMAIN.COM -> krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM in FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc
[17052] 1570607597.686109: Storing config in FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc for krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM: pa_type: 2
[17052] 1570607597.686110: Storing icronyn@MY.DOMAIN.COM -> krb5_ccache_conf_data/pa_type/krbtgt\/MY.DOMAIN.COM\@MY.DOMAIN.COM@X-CACHECONF: in FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc
[16826] 1570607597.439250: ccselect module realm chose cache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with client principal icronyn@MY.DOMAIN.COM for server principal tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM
[16826] 1570607597.439251: Getting credentials icronyn@MY.DOMAIN.COM -> tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM using ccache FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc
[16826] 1570607597.439252: Retrieving icronyn@MY.DOMAIN.COM -> tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM from FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with result: -1765328243/Matching credential not found (filename: /var/opt/thinlinc/sessions/icronyn/10/krb5cc)
[16826] 1570607597.439253: Retrieving icronyn@MY.DOMAIN.COM -> krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM from FILE:/var/opt/thinlinc/sessions/icronyn/10/krb5cc with result: 0/Success
[16826] 1570607597.439254: Starting with TGT for client realm: icronyn@MY.DOMAIN.COM -> krbtgt/MY.DOMAIN.COM@MY.DOMAIN.COM
[16826] 1570607597.439255: Requesting tickets for tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM, referrals on
[16826] 1570607597.439256: Generated subkey for TGS request: rc4-hmac/3343
[16826] 1570607597.439257: etypes requested in TGS request: rc4-hmac
[16826] 1570607597.439259: Encoding request body and padata into FAST request
[16826] 1570607597.439260: Sending request (5328 bytes) to MY.DOMAIN.COM
[16826] 1570607597.439261: Resolving hostname uk-ldap.MY.DOMAIN.COM
[16826] 1570607597.439262: Initiating TCP connection to stream 10.192.10.222:88
[16826] 1570607597.439263: Sending TCP request to stream 10.192.10.222:88
[16826] 1570607597.439264: Received answer (406 bytes) from stream 10.192.10.222:88
[16826] 1570607597.439265: Terminating TCP connection to stream 10.192.10.222:88
[16826] 1570607597.439266: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[16826] 1570607597.439267: No URI records found
[16826] 1570607597.439268: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[16826] 1570607597.439269: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[16826] 1570607597.439270: No SRV records found
[16826] 1570607597.439271: Response was not from master KDC
[16826] 1570607597.439272: Decoding FAST response
[16826] 1570607597.439273: TGS request result: -1765328377/Server not found in Kerberos database
[16826] 1570607597.439274: Requesting tickets for tcp/mywindowsserviceothername.MY.DOMAIN.COM:8184@MY.DOMAIN.COM, referrals off
[16826] 1570607597.439275: Generated subkey for TGS request: rc4-hmac/FCED
[16826] 1570607597.439276: etypes requested in TGS request: rc4-hmac
[16826] 1570607597.439278: Encoding request body and padata into FAST request
[16826] 1570607597.439279: Sending request (5328 bytes) to MY.DOMAIN.COM
[16826] 1570607597.439280: Resolving hostname uk-ldap.MY.DOMAIN.COM
[16826] 1570607597.439281: Initiating TCP connection to stream 10.194.0.87:88
[16826] 1570607597.439282: Sending TCP request to stream 10.194.0.87:88
[16826] 1570607597.439283: Received answer (406 bytes) from stream 10.194.0.87:88
[16826] 1570607597.439284: Terminating TCP connection to stream 10.194.0.87:88
[16826] 1570607597.439285: Sending DNS URI query for _kerberos.MY.DOMAIN.COM.
[16826] 1570607597.439286: No URI records found
[16826] 1570607597.439287: Sending DNS SRV query for _kerberos-master._udp.MY.DOMAIN.COM.
[16826] 1570607597.439288: Sending DNS SRV query for _kerberos-master._tcp.MY.DOMAIN.COM.
[16826] 1570607597.439289: No SRV records found
[16826] 1570607597.439290: Response was not from master KDC
[16826] 1570607597.439291: Decoding FAST response
[16826] 1570607597.439292: TGS request result: -1765328377/Server not found in Kerberos database
Cronan commented 4 years ago

So, the first thing that jumps out to me is that in the working version the server principal used is mywindowsservice-prod, which is the alias used in the SPN and URI by the client to connect.

In the failing version, the server principal used is the server name, not the alias used as above.

SPN = "tcp/mywindowsservice-prod.my.domain.com:8184"
URI = "net.tcp://mywindowsservice-prod.my.domain.com:8184/MyDataService"
nslookup mywindowsservice-prod
Server:  machine123.myotherdomain.com
Address:  10.192.207.245

Non-authoritative answer:
Name:    mywindowsserviceothername.my.domain.com
Address:  10.192.226.115
Aliases:  mywindowsservice-prod.my.domain.com
Cronan commented 4 years ago

@davidsh, He's provided details of the client in the initial bug description. He also explained the server is a Windows WCF service. Does the specific Windows version matter?

@Cronan, for your WCF client, are you explicitly specifying a service identity? You would have a reference to the one of the classes SpnEndpointIdentity, UpnEndpointIdentity or DnsEndpointIdentity. If you don't have a reference to any of these in your code, we default to using a targetName of host/hostname where hostname is extracted from the uri of the service you are connecting to. So depending on your the address you provided to WCF it might not be a FQDN.

Sample code below:

var identity = new SpnEndpointIdentity("tcp/mywindowsservice-prod.my.domain.com:8184");
TokenImpersonationLevel tokenImpersonationLevel = TokenImpersonationLevel.Identification;
var endpointAddress = new EndpointAddress(new Uri(net.tcp://mywindowsservice-prod.my.domain.com:8184/MyDataService/SomeData), identity);
Cronan commented 4 years ago

@davidsh, He's provided details of the client in the initial bug description. He also explained the server is a Windows WCF service. Does the specific Windows version matter?

No. But we need a repro. At the very least, snippets of WCF code for the server and client and how the service is configured.

Also, can you please get KRB5_TRACES for both the working scenario (2.1.3) and broken scenario (2.1.13)?

As a workaround can you try setting this environment variable on the Linux client? It will enable the 'legacy' HTTP stack (uses libcurl on Linux). I'm curious if it will help with this problem at least as a workaround for now:

export DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0

Service config:

<system.serviceModel>
<services>
      <service behaviorConfiguration="MyBehaviour" name="MyDataServiceCode.Facades.SomeData">
        <endpoint address="net.tcp://localhost:8184/MyDataService/SomeData"
                  binding="netTcpBinding" bindingConfiguration="UnreliableTCP"
                  name="MyDataServiceSomeData" contract="MyDataServiceCode.Interfaces.ISomeData" />

      </service>
</services>
<behaviours>
<serviceBehaviors>
<behavior name="MyBehaviour">
          <serviceDebug includeExceptionDetailInFaults="true" />
          <serviceThrottling maxConcurrentCalls="500" maxConcurrentInstances="500" maxConcurrentSessions="500" />
          <dataContractSerializer maxItemsInObjectGraph="2147483647" />
          <serviceAuthorization impersonateCallerForAllOperations="true"></serviceAuthorization>
        </behavior>
</serviceBehaviors>
</behaviours>
    <bindings>
      <netTcpBinding>
        <binding name="UnreliableTCP" receiveTimeout="23:59:59" transferMode="Buffered" maxBufferPoolSize="0" maxBufferSize="2147483647"
                         maxConnections="100" maxReceivedMessageSize="2147483647">
          <reliableSession inactivityTimeout="23:59:59" enabled="false" />
        </binding>
      </netTcpBinding>
   </bindings>
</system.serviceModel>
Cronan commented 4 years ago

@Cronan, it might be useful to find the exact version which caused the break. I wasn't going to suggest that as it can be a lot of work and it's possible that the appropriate devs might know which changes may have broken this, but if you have it working in docker, it might not be a lot of work. Especially if you do a binary search through the versions. Something quicker that would likely be useful is to capture a Kerberos trace for when it's working and when it's not working. You can do this by setting the environment variable KRB5_TRACE before you run your code to capture Kerberos tracing. You can do one of the following:

Output log to stdout

export KRB5_TRACE = /dev/stdout

Output log to file

export KRB5_TRACE = ~/krb5_trace.log

I did the binary search and 2.1.12 is the most recent version that works, and 2.1.13 is the version that starts failing.

davidsh commented 4 years ago

So, the first thing that jumps out to me is that in the working version the server principal used is mywindowsservice-prod, which is the alias used in the SPN and URI by the client to connect. In the failing version, the server principal used is the server name, not the alias used as above.

As an experiment, please try adding this to your krb5.conf file in the [libdefaults] section

dns_canonicalize_hostname = false

Normally, the best-practice is to define SPNs that use the FQDN of the server. And that means using the A record DNS name. But it seems that your environment has defined the SPN against the CNAME (alias) DNS record. Setting the above in your krb5.conf file should prevent forward lookups of the hostname.

I understand that this worked before upgrading to 2.1.13. But I suspect other changes we made in that servicing release has now impacted using CNAMEs as SPNs. Please try this out and also send us updated KRB5_TRACE logs.

mconnew commented 4 years ago

@Cronan, you are including the TCP port number your service is running on as part of the SPN name. Is that intentional? Is the port number part of the SPN as set up in your DC?

@davidsh, only a process running as the SYSTEM user or one of it's equivalents such as Network Service are able to use the hostname or FQDN SPN. If you are running with any other user, you normally need to create a different SPN to use.

@davidsh, I could understand the CNAME/FQDN thing when using HttpClient, but this is NegotiateStream. WCF is explicitly proving the targetName from the provided SpnEndpointIdentity. Are you saying that NegotitateStream will use something different than what was provided to the targetName argument?

davidsh commented 4 years ago

Are you saying that NegotitateStream will use something different than what was provided to the targetName argument?

NegotiateStream, HttpClient, and SqlClient all use a common native library, System.Net.Security.Native on Linux. That library does the various GSS-API calls to support Negotiate/Kerberos/NTLM protocol. GSS-API uses the Linux krb5.conf file for various settings.

The transformation of the passed in 'targetName' from NegotiateStream is still present at the API call to the GSS-API. But in the implementation of the GSS-API call, SPNEGO protocol uses the settings in the krb5.conf file and apparently is transforming the 'targetName' and normalizing it (i.e CNAME -> A record) before doing a Kerberos principal lookup. That transformation is not something controllable at the .NET layer. That is why I suggested to try the dns_canonicalize_hostname = false in the krb5.conf file as an experiment to see it is helps. It will help narrow down what is happening in this network environment.

mconnew commented 4 years ago

I found the documentation for this behavior. This seems an odd design decision as the app can do that dance if needed but as you said, nothing that can be done about it. Based on those docs, it looks like this was caused by a change to use GSS_C_NT_HOSTBASED_SERVICE.

davidsh commented 4 years ago

Based on those docs, it looks like this was caused by a change to use GSS_C_NT_HOSTBASED_SERVICE.

We made the change in .NET Core 3.0 and servicing for .NET Core 2.1/2.2 because a lot of scenarios were broken until we start using GSS_C_NT_HOSTBASED_SERVICE, which appears to be the preferred format for the SPNEGO protocol plugin for GSS-API.

But this issue here raises some interesting test areas that we probably want to address in a better way long-term in .NET Core. However, short-term, adjusting krb5.conf might be the best way to unblock things for now.

Cronan commented 4 years ago

@Cronan, you are including the TCP port number your service is running on as part of the SPN name. Is that intentional? Is the port number part of the SPN as set up in your DC?

Yes, the port number is is part of the SPN. We tend to use the same name but different ports for TCP vs REST, for example.

Cronan commented 4 years ago

Based on those docs, it looks like this was caused by a change to use GSS_C_NT_HOSTBASED_SERVICE.

We made the change in .NET Core 3.0 and servicing for .NET Core 2.1/2.2 because a lot of scenarios were broken until we start using GSS_C_NT_HOSTBASED_SERVICE, which appears to be the preferred format for the SPNEGO protocol plugin for GSS-API.

But this issue here raises some interesting test areas that we probably want to address in a better way long-term in .NET Core. However, short-term, adjusting krb5.conf might be the best way to unblock things for now.

I'm definitely going to try this today, but it's obviously not a fix for an entire environment - for example, it would prevent short names (like myservice-prod) from being canonicalised to myservice-prod.mydomain.com, if I understand the option correctly:

dns_canonicalize_hostname Indicate whether name lookups will be used to canonicalize hostnames for use in service principal names. Setting this flag to false can improve security by reducing reliance on DNS, but means that short hostnames will not be canonicalized to fully-qualified hostnames. The default value is true.

It also leaves my systems in a weird position, where most Linux calls that need authentication (e.g. curl <hostname> --anyauth -u : hitting a WCF REST API, also using kerberos) seem totally happy with cnames or a records, but .NET Core is super-strict on requiring a records across the board.

Or am I misunderstanding the situation?

davidsh commented 4 years ago

It also leaves my systems in a weird position, where most Linux calls that need authentication (e.g. curl --anyauth -u : hitting a WCF REST API, also using kerberos) seem totally happy with cnames or a records, but .NET Core is super-strict on requiring a records across the board. Or am I misunderstanding the situation?

Are all your SPNs defined only against FQDN CNAMES? Or do you have a mixture of services defined with SPNs using either A or CNAME?

davidsh commented 4 years ago

Interesting article talking about changes to MIT Kerberos libraries that will allow for trying different behavior (CNAME first, then A if CNAME fails) when trying to find proper SPN.

This new version of MIT Kerberos (Release 1.18) will have a new option for dns_canonicalize_hostname option in the krb5.conf file:

dns_canonicalize_hostname = fallback

See: http://k5wiki.kerberos.org/wiki/Projects/Server_Hostname_Canonicalization

Cronan commented 4 years ago

It also leaves my systems in a weird position, where most Linux calls that need authentication (e.g. curl --anyauth -u : hitting a WCF REST API, also using kerberos) seem totally happy with cnames or a records, but .NET Core is super-strict on requiring a records across the board. Or am I misunderstanding the situation?

Are all your SPNs defined only against FQDN CNAMES? Or do you have a mixture of services defined with SPNs using either A or CNAME?

There are a mixture of services defined with SPNs using either A or CNAME. Either seem to work OK in Windows, some are tricky from Linux.

Cronan commented 4 years ago

Interesting article talking about changes to MIT Kerberos libraries that will allow for trying different behavior (CNAME first, then A if CNAME fails) when trying to find proper SPN.

This new version of MIT Kerberos (Release 1.18) will have a new option for dns_canonicalize_hostname option in the krb5.conf file:

dns_canonicalize_hostname = fallback

See: http://k5wiki.kerberos.org/wiki/Projects/Server_Hostname_Canonicalization

It looks to me like Centos 7 is currently on 1.15

sudo yum list installed | grep kr
krb5-devel.x86_64          1.15.1-37.el7_6  @lgb-7.4-centos-base-updates-2019-07
krb5-libs.x86_64           1.15.1-37.el7_6  @lgb-7.4-centos-base-updates-2019-07
krb5-workstation.x86_64    1.15.1-37.el7_6  @lgb-7.4-centos-base-updates-2019-07

Yes, definitely 1.15.1:

klist -V
Kerberos 5 version 1.15.1

But this looks interesting - do you think this will work with GSS_C_NT_HOSTBASED_SERVICE?

davidsh commented 4 years ago

do you think this will work with GSS_C_NT_HOSTBASED_SERVICE?

I haven't tried it out yet. But those krb5.conf options do affect GSS_C_NT_HOSTBASED_SERVICE name types. So, I do expect it to work with the new dns_canonicalize_hostname = fallback setting.

mconnew commented 4 years ago

@davidsh, it would seem that the switch to using GSS_C_NT_HOSTBASED_SERVICE is a breaking change. Would it be possible to retroactively add an AppSetting to control this on a per-app basis? Requiring a system wide krb5.conf configuration which affects all Kerberos usage (and could break a different scenario when setting things so .NET Core works) seems a little too coarse to control behavior.

davidsh commented 4 years ago

Requiring a system wide krb5.conf configuration which affects all Kerberos usage (and could break a different scenario when setting things so .NET Core works) seems a little too coarse to control behavior.

Apps can use a custom krb5.conf file by setting an environment variable:

http://web.mit.edu/kerberos/krb5-1.13/doc/admin/conf_files/krb5_conf.html

krb5.conf
The krb5.conf file contains Kerberos configuration information, including the locations of KDCs and
admin servers for the Kerberos realms of interest, defaults for the current realm and for Kerberos
applications, and mappings of hostnames onto Kerberos realms. Normally, you should install your
krb5.conf file in the directory /etc. You can override the default location by setting the environment
variable KRB5_CONFIG. Multiple colon-separated filenames may be specified in KRB5_CONFIG; all
files which are present will be read.

@Cronan

There are a mixture of services defined with SPNs using either A or CNAME. Either seem to work OK in Windows, some are tricky from Linux.

Are all your channels/services using the same type of WCF bindings? Are any using HTTP related WCF bindings? I Would like to understand if any of them use .NET Core classes besides System.Net.Security.NegotiateStream.

Cronan commented 4 years ago

All the services I’m currently using are WCF TCP - I’ll try the custom Conf file and report back.

I’ll also investigate and see whether there are HTTP services in use that I can test.

On 19 Oct 2019, at 18:59, David Shulman notifications@github.com wrote:

 Requiring a system wide krb5.conf configuration which affects all Kerberos usage (and could break a different scenario when setting things so .NET Core works) seems a little too coarse to control behavior.

Apps can use a custom krb5.conf file by setting an environment variable:

http://web.mit.edu/kerberos/krb5-1.13/doc/admin/conf_files/krb5_conf.html

krb5.conf The krb5.conf file contains Kerberos configuration information, including the locations of KDCs and admin servers for the Kerberos realms of interest, defaults for the current realm and for Kerberos applications, and mappings of hostnames onto Kerberos realms. Normally, you should install your krb5.conf file in the directory /etc. You can override the default location by setting the environment variable KRB5_CONFIG. Multiple colon-separated filenames may be specified in KRB5_CONFIG; all files which are present will be read. @Cronan

There are a mixture of services defined with SPNs using either A or CNAME. Either seem to work OK in Windows, some are tricky from Linux.

Are all your channels/services using the same type of WCF bindings? Are any using HTTP related WCF bindings? I Would like to understand if any of them use .NET Core class besides System.Net.Security.NegotiateStream.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

davidsh commented 4 years ago

All the services I’m currently using are WCF TCP - I’ll try the custom Conf file and report back.

Have you been able to test using the custom .conf file?

Also, do you have NTLM installed on the Linux machines (gss-ntlmssp)? And what credential are you using in the .NET layer? Is it CredentialCache.DefaultCredentials?

The reason for the questions is that when Negotiate (SPNEGO) protocol is used but has errors with Kerberos, NTLM fallback would be possible assuming you are using explicit credentials and have installed the NTLM package on the Linux machine.

Cronan commented 4 years ago

All the services I’m currently using are WCF TCP - I’ll try the custom Conf file and report back.

Have you been able to test using the custom .conf file?

Yes, this worked! 🎉 Thank you @davidsh and @mconnew

Also, do you have NTLM installed on the Linux machines (gss-ntlmssp)? Yes, I believe so ...

sudo yum list installed | grep ntlm
libntlm.x86_64             1.3-6.el7        @lgb-7.4-centos-base

And what credential are you using in the .NET layer? Is it CredentialCache.DefaultCredentials?

I don't see anything like this call in my WCF client code, where would it be?

The reason for the questions is that when Negotiate (SPNEGO) protocol is used but has errors with Kerberos, NTLM fallback would be possible assuming you are using explicit credentials and have installed the NTLM package on the Linux machine.

Cronan commented 4 years ago

@karelz What more info do you need from me?

karelz commented 4 years ago

@Cronan it was part of our triage, so I assume @davidsh didn't feel like he has all the info to make it actionable yet. There is lots of back and forth on this issue. What is IMO needed is clear repro environment setup that will allow us to reproduce it locally and dig deeper. Do you think you can help us identify the key environment setup knobs to reproduce it?

davidsh commented 4 years ago

@Cronan it was part of our triage, so I assume @davidsh didn't feel like he has all the info to make it actionable yet.

There is lots of back and forth on this issue. What is IMO needed is clear repro environment setup that will allow us to reproduce it locally and dig deeper. Do you think you can help us identify the key environment setup knobs to reproduce it?

Actually, I do understand the root cause here. This issue is caused by a regression in behavior due to a PR fixing related Linux Kerberos issues as described above. See: https://github.com/dotnet/runtime/issues/31110#issuecomment-540230816

I know how to reproduce this issue.

The fix for this issue is complex because it will require some changes to the PAL layer to fix both this issue and preserve the other fixes previously made. I have been thinking about the proper fix. We can also test the fix in the 'runtime-libraries enterprise-linux' DevOps pipline that I previously created.

karelz commented 4 years ago

@davidsh do you think we can address this realistically in 5.0?

davidsh commented 4 years ago

It's a non-trivial fix. The fix has to be done in the Linux PAL layer. It has to be able to try two different forms of Kerberos principal names. I.e. try one form and if it doesn't match to an SPN in Kerberos, then try the alternate format.

Based on current priorities and staffing I don't see this as achievable for 5.0.

Cronan commented 4 years ago

That’s disappointing but understandable. If someone can point me to the relevant place in the code to start looking I’m prepared to try to take this on.