dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

Incomplete chain of mutual TLS client cert downloaded twice per http call when run as non-root linux user #81456

Open EklipZgit opened 1 year ago

EklipZgit commented 1 year ago

Description

The quick overview:

Reproduction Steps

Hopefully this is reasonably easy for someone to reproduce who is familiar with the X509 parts of the runtime. I don't have time to sort out how to try to reproduce this behavior with a fake CA cert with AIA data generated via openSSL so hopefully this is easy to reproduce.

asp net 6 web api. linux docker container. This is the entire docker file with the exception of a swagger generator util and unit test run steps for brevity.

FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base
WORKDIR /app

#Expose port for container app to listen on port 8080, because non-root user does not have permission to listen on privileged ports (< 1024)
EXPOSE 8080
ENV ASPNETCORE_URLS=http://+:8080

#Add a new group "dotnet" with group id 10001 and new user "dotnet" with user id 10000
# comment this out when testing the "root" scenario
RUN groupadd -g 10001 dotnet \
   && useradd -m -u 10000 -g 10001 dotnet

ENV DOTNET_RUNNING_IN_CONTAINER=true

FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["./LoanService.API/LoanService.API.csproj", "./LoanService.API/"]
COPY ["./LoanService.API.DataAccess/LoanService.API.DataAccess.csproj", "./LoanService.API.DataAccess/"]
COPY ["./nuget.config", "."]

# Populates the nuget.config environment variable
ARG Azure_DevOps_Artifact_PAT

# Build
RUN dotnet restore "./LoanService.API/LoanService.API.csproj"
COPY . .
WORKDIR /src
RUN dotnet build "./LoanService.API/LoanService.API.csproj" -c Release -o /app

# Publish
FROM build AS publish
RUN dotnet publish "./LoanService.API/LoanService.API.csproj" -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .

# Switch to non-root privileged user
# comment this line out when testing the "root" version side by side
USER dotnet:dotnet

ENTRYPOINT [ "/bin/bash", "-c", "dotnet LoanService.API.dll"]

In the web api itself, here's the relevant bits of the csproj for reference:

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>net6.0</TargetFramework>
    <GenerateDocumentationFile>true</GenerateDocumentationFile>
    <NoWarn>$(NoWarn);1591</NoWarn>
    <PackageId>LoanService.API</PackageId>
    <Product>LoanService API Service</Product>
    <Description>Loan Servicing REST API service project</Description>
    <AspNetCoreHostingModel>InProcess</AspNetCoreHostingModel>
    <DockerDefaultTargetOS>Linux</DockerDefaultTargetOS>
    <DockerfileContext>.</DockerfileContext>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="AspNetCore.HealthChecks.SqlServer" Version="6.0.2" />
    <PackageReference Include="Atom.Common.Api" Version="2.5.492-alpha" />
    <PackageReference Include="Atom.Common.SoapApiClient" Version="2.5.492-alpha" />
    <PackageReference Include="AutoMapper.Extensions.Microsoft.DependencyInjection" Version="8.1.1" />
    <PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.4.0" />
    <PackageReference Include="Microsoft.VisualStudio.Azure.Containers.Tools.Targets" Version="1.15.1" />
    <PackageReference Include="Swashbuckle.AspNetCore.Annotations" Version="6.4.0" />
  </ItemGroup>

Program.cs logging, shouldn't really matter but including just in case somehow relevant:


    public class Program
    {
        public static int Main(string[] args)
        {
            AtomHost.ConfigureLogger(args);

            return AtomHost.Run(CreateHostBuilder, args);
        }

        public static IHostBuilder CreateHostBuilder(string[] args) =>
            Host.CreateDefaultBuilder(args)
                .UseAtomLogging()
                .ConfigureWebHostDefaults(webBuilder =>
                {
                    webBuilder.UseStartup<Startup>();
                });
    }

    ...

    public static IHostBuilder UseAtomLogging(this IHostBuilder builder, Action<LoggingMiddleware.LoggingMiddlewareSettings>? configure = null)
    {
        builder.ConfigureServices((builderContext, services) =>
        {
            var configuration = builderContext.Configuration;
            var settingsOpt = services.Configure<LoggingMiddleware.LoggingMiddlewareSettings>(configuration.GetSection("AtomApiLogging"));
            if (configure != null)
                settingsOpt.Configure(configure);

            services.AddScoped<TraceInfoHolder>();

            var aiOptions = new ApplicationInsightsServiceOptions
            {
                EnableAdaptiveSampling = false,
                ConnectionString = configuration.GetValue<string>("ApplicationInsights:ConnectionString")
            };

            if (string.IsNullOrEmpty(aiOptions.ConnectionString))
            {
                aiOptions.InstrumentationKey = configuration.GetValue<string>("ApplicationInsights:InstrumentationKey");
            }

            services.AddHttpContextAccessor();
            services.AddApplicationInsightsTelemetry(aiOptions);

            if (Type.GetType("Azure.Messaging.ServiceBus.ServiceBusClient, Azure.Messaging.ServiceBus") != null)
            {
                services.AddApplicationInsightsTelemetryProcessor<ServiceBusTelemetryFilter>();
            }

            services.AddSensitiveDataRedactor();
            services.AddLoggableRequestHeaders();
        });

        return builder.UseSerilog();
    }

    .... (AtomHost)
    public static int Run(Func<string[], IHostBuilder> createHost, string[] args)
    {
        try
        {
            Log.Information("Starting Host");
            createHost(args).Build().Run();
            return 0;
        }
        catch (Exception ex)
        {
            Log.Fatal(ex, "Host terminated unexpectedly");
            return 1;
        }
        finally
        {
            Log.Information("Host shut down");
            Log.CloseAndFlush();
        }
    }

And in its Startup.cs, 3 different methods of setting up HttpClients with client certificates all repro the issue: ConfigureServices

        ....
        var lstCertThumbprint = Configuration.GetValue<string>("LST:CertThumbPrint");
        var lstCert = string.IsNullOrWhiteSpace(lstCertThumbprint) ? null : CertificateUtil.GetCertificateByThumbprint(lstCertThumbprint);

        var httpClientBuilder = services.AddHttpClient<ILoanServRestService, LoanServRestService>();

        if (!string.IsNullOrWhiteSpace(lstCertThumbprint))
        {
            // This client reproduces the issue
            httpClientBuilder = httpClientBuilder.UseClientCertificate(lstCertThumbprint);
        }
        ....

        services.AddHttpClient(nameof(LoanServHealthCheck))
            .ConfigurePrimaryHttpMessageHandler(() =>
            {
                var httpClientHandler = new HttpClientHandler();
                // This client also reproduces the issue
                httpClientHandler.ClientCertificates.Add(lstCert);
                return httpClientHandler;
            });

        ....

        // This client also reproduces the issue. Since the two out of the box above also reproduce it, 
        // I wont bother including the implementation of this soap client as its rather convoluted, 
        // but leaving this here to point out that it also happens for an Http soap binding with the 
        // mutual TLS cert as well.
        services.AddLoanServSoapClientFactory<GetAccountsBySSN.PI00WEBSPort, GetAccountsBySSN.PI00WEBSPortClient>(
            "GetAccountBySSN_PI00WEBS",
            "wwsp1000",
            lstCert
        );

We inject the cert into the Azure AppService via the out of the box Azure AppService private cert upload https://learn.microsoft.com/en-us/azure/app-service/configure-ssl-certificate?tabs=apex%2Cportal

Note that the top level CA in the chain of the cert should be in the linux containers ca store, however the mid-level intermediate in the chain should NOT be, and the cert must include AIA information in its x509v3 chain with which to download the DER of the intermediate cert in order to fully reproduce our scenario.

Then just use some of those http clients to make some mutual TLS calls. If you monitor traffic from the container via AppInsights, you'll see the dependency calls going out to the AIA target url. I assume you could also capture traffic with Fiddler or something instead.

Expected behavior

When running as non-root, dotnet should download the missing cert chain via AIA and save it in the local dotnet /ca/ store just like it does when running as root, and should not continue to redownload the chain for each subsequent request.

Actual behavior

After changing the user our dotnet api was running under from root to non-root, we noticed a massive uptick (from basically zero to >3million per day) in dependency calls to http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt when making a mutual TLS call to our internal gateway (which has a server SSL cert with the same chain as the mutual TLS cert we send). Every single outbound https call to our gateway would make two GET requests to this sectigo URL. We had absolutely no references to that url ourselves so we began digging and found that that URL comes from the Sectigo certificates AIA section, and from there learned about how AIA is used to download incomplete chains.

We converted the service back to running as root user, and found that it made the download request twice on first http call and then never makes it again for the life of the container (still why twice? two chain paths?). After digging through dotnet runtime issues and various other places we found that on linux the place where dotnet is supposed to be storing the just-in-time AIA chains is here and I added some debug endpoints to dump the internals of the container as we don't have SSH enabled for security reasons: image (note /root/ when running as root, otherwise the users home when running as non-root). Here you can see that we have two revocations cached in the /crls/ folder and 4 pfx's cached in the /ca/ folder.

When running as non-root, dotnet still creates this directory structure including the /ca/ folder (under the non-root users home dir), and populates the two /crls/ entries (so clearly it has write access to this directory structure). However, the /ca/ folder remains empty. Etw logs seem to show this entry: ExceptionMessage="The owner of '/home/dotnet/.dotnet/corefx/cryptography/x509stores/ca' is not the current user"

Since both the gateways SSL cert and the client mutual TLS cert use the same offending chain, we eliminated the mutual TLS cert from the equation by ceasing to send it and removing the requirement for it from the gateway, and the sectigo calls vanished entirely so we can say for certain that it is the mutual TLS client cert that triggers this behavior (and not the target servers SSL cert).

Presumably something is going on that's causing dotnet to be unable / refuse to cache these missing parts of the cert chain that dotnet wants to send to the gateway as part of the mutual TLS handshake and is redownloading them (twice..?) from the AIA information in every request when running as non-root.

Once I understood the problem domain enough I was able to work around the issue by explicitly installing the .crt version of http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt (which is actually a DER, not crt, not relevant to this bug as .NET doesn't care about the extension when parsing but mentioning in case someone looks at it and ends up trying to use / test it as a CRT when it is in fact a DER. The equivalent CRT can actually be obtained from here https://support.sectigo.com/articles/Knowledge/Sectigo-Intermediate-Certificates ). With the intermediate ca cert explicitly imported into the linux ca store, we now get 0 calls to sectigo (vs 2 calls on first http call when running as root, and vs 2 calls per one http request when running as non-root). So now notably I'm assuming the reason many people haven't run into this before is likely because A) they're running as root, or B) the mutual TLS certs they're using have the full chain on the box (or have the full chain self contained in the cert they use? Not sure if that's a normal thing).

So while I have worked around the issue, I couldn't find any documentation of others running into the same thing, closest relative seems to be this, https://github.com/dotnet/runtime/issues/29653 which I learned a lot of useful info from. At a glance I'm guessing the problem is around here https://github.com/dotnet/runtime/blob/983c8f239a98812498d874ed36b7001bd764fdfe/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslDirectoryBasedStoreProvider.cs or here https://github.com/dotnet/runtime/blob/af263f7a2b0a309b5ac79ad92f4f7217da906b78/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainProcessor.cs#L268 but I'm far out of my comfort zone at this point so it's just pure speculation.

Regression?

Dont know.

Known Workarounds

Install the intermediate certs into the container with Docker

COPY ["./LoanService.API/Resources/SectigoRSAOrganizationValidationSecureServerCA.crt", "/usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt"]

RUN chmod 644 /usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt \
   && update-ca-certificates

so that dotnet doesn't need to try to download them from AIA.

Configuration

.NET 6

Debian 11, per the 6.0 tag here: https://hub.docker.com/_/microsoft-dotnet-aspnet#:~:text=6.0.13%2Dbullseye%2Dslim%2Damd64%2C%206.0%2Dbullseye%2Dslim%2Damd64%2C%206.0.13%2Dbullseye%2Dslim%2C%206.0%2Dbullseye%2Dslim%2C%206.0.13%2C%206.0

unrelated to port 8080 change, running as root port 8080 does not reproduce the issue.

Other information

No response

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl, @vcsjones See info in area-owners.md if you want to be subscribed.

Issue Details
### Description The quick overview: * We internally use SSL certs generated by Sectigo Organizational CA (not part of default linux ca cert bundle, though the parent in the chain is). * We have dotnet 6.0 web apis running on Linux containers. This web api is the CLIENT for the purpose of this ticket, when I refer to 'server' I am talking about the target gateway that this api is making a mutual TLS request to. Docker image `FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base` * We run these APIs on Azure AppService and inject the mutual TLS cert into the container with the private cert feature in Azure. Not particularly important but of note is that the cert includes the cert chain, at least referentially. * These Sectigo certs have x509v3 Authority Information Access data in them. * We bind the mutual TLS cert to http clients with a standard `httpClientBuilder.UseClientCertificate(lstCertThumbprint);` * Because part of the intermediate cert chain is not in the CA store, .NET downloads the chain via the AIA url in order to send the full chain to the server in the mutual TLS handshake. * When running as root user, this download happens twice on the first http request and never again until the container is rebuilt. The dotnet ca cert store is populated with the downloaded intermediate cert. * When running as non-root, this download happens twice on every request and the dotnet ca cert store is never populated with the downloaded intermediate cert. * When the mutual TLS cert is removed from the equation (both client and server), the symptoms go away entirely (eliminating the servers SSL certificate as the source of the behavior, since it also used the same chain it was important to verify that the Mutual TLS cert was causing this behavior before I filed this issue). ### Reproduction Steps Hopefully this is reasonably easy for someone to reproduce who is familiar with the X509 parts of the runtime. I don't have time to sort out how to try to reproduce this behavior with a fake CA cert with AIA data generated via openSSL so hopefully this is easy to reproduce. asp net 6 web api. linux docker container. This is the entire docker file with the exception of a swagger generator util and unit test run steps for brevity. ``` FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base WORKDIR /app #Expose port for container app to listen on port 8080, because non-root user does not have permission to listen on privileged ports (< 1024) EXPOSE 8080 ENV ASPNETCORE_URLS=http://+:8080 #Add a new group "dotnet" with group id 10001 and new user "dotnet" with user id 10000 # comment this out when testing the "root" scenario RUN groupadd -g 10001 dotnet \ && useradd -m -u 10000 -g 10001 dotnet ENV DOTNET_RUNNING_IN_CONTAINER=true FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build WORKDIR /src COPY ["./LoanService.API/LoanService.API.csproj", "./LoanService.API/"] COPY ["./LoanService.API.DataAccess/LoanService.API.DataAccess.csproj", "./LoanService.API.DataAccess/"] COPY ["./nuget.config", "."] # Populates the nuget.config environment variable ARG Azure_DevOps_Artifact_PAT # Build RUN dotnet restore "./LoanService.API/LoanService.API.csproj" COPY . . WORKDIR /src RUN dotnet build "./LoanService.API/LoanService.API.csproj" -c Release -o /app # Publish FROM build AS publish RUN dotnet publish "./LoanService.API/LoanService.API.csproj" -c Release -o /app FROM base AS final WORKDIR /app COPY --from=publish /app . # Switch to non-root privileged user # comment this line out when testing the "root" version side by side USER dotnet:dotnet ENTRYPOINT [ "/bin/bash", "-c", "dotnet LoanService.API.dll"] ``` In the web api itself, here's the relevant bits of the csproj for reference: ``` net6.0 true $(NoWarn);1591 LoanService.API LoanService API Service Loan Servicing REST API service project InProcess Linux . ``` Program.cs logging: ``` public class Program { public static int Main(string[] args) { AtomHost.ConfigureLogger(args); return AtomHost.Run(CreateHostBuilder, args); } public static IHostBuilder CreateHostBuilder(string[] args) => Host.CreateDefaultBuilder(args) .UseAtomLogging() .ConfigureWebHostDefaults(webBuilder => { webBuilder.UseStartup(); }); } ... public static IHostBuilder UseAtomLogging(this IHostBuilder builder, Action? configure = null) { builder.ConfigureServices((builderContext, services) => { var configuration = builderContext.Configuration; var settingsOpt = services.Configure(configuration.GetSection("AtomApiLogging")); if (configure != null) settingsOpt.Configure(configure); services.AddScoped(); // this adds telemetry like timing and dependencies var aiOptions = new ApplicationInsightsServiceOptions { EnableAdaptiveSampling = false, ConnectionString = configuration.GetValue("ApplicationInsights:ConnectionString") }; #pragma warning disable CS0618 // Type or member is obsolete // supplying instrumentation key is obsolete, prefer using connection string if (string.IsNullOrEmpty(aiOptions.ConnectionString)) { aiOptions.InstrumentationKey = configuration.GetValue("ApplicationInsights:InstrumentationKey"); } #pragma warning restore CS0618 // Type or member is obsolete services.AddHttpContextAccessor(); services.AddApplicationInsightsTelemetry(aiOptions); // This telemetry filter may be causing logging issues in prod, we believe Request and Dependency telemetry is required for proper // AI traces. Sampling is enabled in prod at the trace level in AI globally and we should not sample // at the individual service level. // services.AddTransient(); if (Type.GetType("Azure.Messaging.ServiceBus.ServiceBusClient, Azure.Messaging.ServiceBus") != null) { // if azure service bus client library is present, filter out a useless service bus telemetry item services.AddApplicationInsightsTelemetryProcessor(); } services.AddSensitiveDataRedactor(); services.AddLoggableRequestHeaders(); }); return builder.UseSerilog(); } .... (AtomHost) public static int Run(Func createHost, string[] args) { try { Log.Information("Starting Host"); createHost(args).Build().Run(); return 0; } catch (Exception ex) { Log.Fatal(ex, "Host terminated unexpectedly"); return 1; } finally { Log.Information("Host shut down"); Log.CloseAndFlush(); } } ``` And in its Startup.cs, 3 different methods of setting up HttpClients with client certificates all repro the issue: ConfigureServices ``` .... var lstCertThumbprint = Configuration.GetValue("LST:CertThumbPrint"); var lstCert = string.IsNullOrWhiteSpace(lstCertThumbprint) ? null : CertificateUtil.GetCertificateByThumbprint(lstCertThumbprint); var httpClientBuilder = services.AddHttpClient(); if (!string.IsNullOrWhiteSpace(lstCertThumbprint)) { // This client reproduces the issue httpClientBuilder = httpClientBuilder.UseClientCertificate(lstCertThumbprint); } .... services.AddHttpClient(nameof(LoanServHealthCheck)) .ConfigurePrimaryHttpMessageHandler(() => { var httpClientHandler = new HttpClientHandler(); // This client also reproduces the issue httpClientHandler.ClientCertificates.Add(lstCert); return httpClientHandler; }); .... // This client also reproduces the issue. Since the two out of the box above also reproduce it, // I wont bother including the implementation of this soap client as its rather convoluted, // but leaving this here to point out that it also happens for an Http soap binding with the // mutual TLS cert as well. services.AddLoanServSoapClientFactory( "GetAccountBySSN_PI00WEBS", "wwsp1000", lstCert ); ``` We inject the cert into the Azure AppService via the out of the box Azure AppService private cert upload https://learn.microsoft.com/en-us/azure/app-service/configure-ssl-certificate?tabs=apex%2Cportal Note that the top level CA in the chain of the cert should be in the linux containers ca store, however the mid-level intermediate in the chain should NOT be, and the cert must include AIA information in its x509v3 chain with which to download the DER of the intermediate cert in order to fully reproduce our scenario. Then just use some of those http clients to make some mutual TLS calls. If you monitor traffic from the container via AppInsights, you'll see the dependency calls going out to the AIA target url. I assume you could also capture traffic with Fiddler or something instead. ### Expected behavior When running as non-root, dotnet should download the missing cert chain via AIA and save it in the local dotnet /ca/ store just like it does when running as root, and should not continue to redownload the chain for each subsequent request. ### Actual behavior After changing the user our dotnet api was running under from root to non-root, we noticed a massive uptick (from basically zero to >3million per day) in dependency calls to http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt when making a mutual TLS call to our internal gateway (which has a server SSL cert with the same chain as the mutual TLS cert we send). Every single outbound https call to our gateway would make two GET requests to this sectigo URL. We had absolutely no references to that url ourselves so we began digging and found that that URL comes from the Sectigo certificates AIA section, and from there learned about how AIA is used to download incomplete chains. We converted the service back to running as root user, and found that it made the download request twice on first http call and then never makes it again for the life of the container (still why twice? two chain paths?). After digging through dotnet runtime issues and various other places we found that on linux the place where dotnet is supposed to be storing the just-in-time AIA chains is here and I added some debug endpoints to dump the internals of the container because our security team wont let us turn on SSH for some reason: ![image](https://user-images.githubusercontent.com/7151703/215932380-d49b5e61-115a-4ed9-9687-3222260e1884.png) (note /root/ when running as root, otherwise the users home when running as non-root). Here you can see that we have two revocations cached in the /crls/ folder and 4 pfx's cached in the /ca/ folder. When running as non-root, dotnet still creates this directory structure including the /ca/ folder (under the non-root users home dir), and populates the two /crls/ entries (so clearly it has write access to this directory structure). However, the /ca/ folder remains empty. Since both the gateways SSL cert and the client mutual TLS cert use the same offending chain, we eliminated the mutual TLS cert from the equation by ceasing to send it and removing the requirement for it from the gateway, and the sectigo calls vanished entirely so we can say for certain that it is the mutual TLS client cert that triggers this behavior (and not the target servers SSL cert). Presumably something is going on that's causing dotnet to be unable / refuse to cache these missing parts of the cert chain that dotnet wants to send to the gateway as part of the mutual TLS handshake and is redownloading them (twice..?) from the AIA information in every request when running as non-root. Once I understood the problem domain enough I was able to work around the issue by explicitly installing the .crt version of http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt (which is actually a DER, not crt, not relevant to this bug as .NET doesn't care about the extension when parsing but mentioning in case someone looks at it and ends up trying to use / test it as a CRT when it is in fact a DER. The equivalent CRT can actually be obtained from here https://support.sectigo.com/articles/Knowledge/Sectigo-Intermediate-Certificates ). With the intermediate ca cert explicitly imported into the linux ca store, we now get 0 calls to sectigo (vs 2 calls on first http call when running as root, and vs 2 calls per one http request when running as non-root). So now notably I'm assuming the reason many people haven't run into this before is likely because A) they're running as root, or B) the mutual TLS certs they're using have the full chain on the box (or have the full chain self contained in the cert they use? Not sure if that's a normal thing). So while I have worked around the issue, I couldn't find any documentation of others running into the same thing, closest relative seems to be this, https://github.com/dotnet/runtime/issues/29653 which I learned a lot of useful info from. At a glance I'm guessing the problem is around here https://github.com/dotnet/runtime/blob/983c8f239a98812498d874ed36b7001bd764fdfe/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslDirectoryBasedStoreProvider.cs or here https://github.com/dotnet/runtime/blob/af263f7a2b0a309b5ac79ad92f4f7217da906b78/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainProcessor.cs#L268 but I'm far out of my comfort zone at this point so it's just pure speculation. ### Regression? Dont know. ### Known Workarounds Install the intermediate certs into the container with Docker ``` COPY ["./LoanService.API/Resources/SectigoRSAOrganizationValidationSecureServerCA.crt", "/usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt"] RUN chmod 644 /usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt \ && update-ca-certificates ``` so that dotnet doesn't need to try to download them from AIA. ### Configuration .NET 6 Debian 11, per the 6.0 tag here: https://hub.docker.com/_/microsoft-dotnet-aspnet#:~:text=6.0.13%2Dbullseye%2Dslim%2Damd64%2C%206.0%2Dbullseye%2Dslim%2Damd64%2C%206.0.13%2Dbullseye%2Dslim%2C%206.0%2Dbullseye%2Dslim%2C%206.0.13%2C%206.0 unrelated to port 8080 change, running as root port 8080 does not reproduce the issue. ### Other information _No response_
Author: EklipZgit
Assignees: -
Labels: `area-System.Net.Security`, `untriaged`
Milestone: -
wfurt commented 1 year ago

The behavior you describing comes from X509Chain. AFAIK it would cache the certificates as long as the user's home is writable. If you have simple app, you can try it with strace @EklipZgit. You can probably reproduce it even without doing any ssl. @bartonjs is expert on this.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones See info in area-owners.md if you want to be subscribed.

Issue Details
### Description The quick overview: * We internally use SSL certs generated by Sectigo Organizational CA (not part of default linux ca cert bundle, though the parent in the chain is). * We have dotnet 6.0 web apis running on Linux containers. This web api is the CLIENT for the purpose of this ticket, when I refer to 'server' I am talking about the target gateway that this api is making a mutual TLS request to. Docker image `FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base` * We run these APIs on Azure AppService and inject the mutual TLS cert into the container with the private cert feature in Azure. Not particularly important but of note is that the cert includes the cert chain, at least referentially. * These Sectigo certs have x509v3 Authority Information Access data in them. * We bind the mutual TLS cert to http clients with a standard `httpClientBuilder.UseClientCertificate(lstCertThumbprint);` * Because part of the intermediate cert chain is not in the CA store, .NET downloads the chain via the AIA url in order to send the full chain to the server in the mutual TLS handshake. * When running as root user, this download happens twice on the first http request and never again until the container is rebuilt. The dotnet ca cert store is populated with the downloaded intermediate cert. * When running as non-root, this download happens twice on every request and the dotnet ca cert store is never populated with the downloaded intermediate cert. * When the mutual TLS cert is removed from the equation (both client and server), the symptoms go away entirely (eliminating the servers SSL certificate as the source of the behavior, since it also used the same chain it was important to verify that the Mutual TLS cert was causing this behavior before I filed this issue). ### Reproduction Steps Hopefully this is reasonably easy for someone to reproduce who is familiar with the X509 parts of the runtime. I don't have time to sort out how to try to reproduce this behavior with a fake CA cert with AIA data generated via openSSL so hopefully this is easy to reproduce. asp net 6 web api. linux docker container. This is the entire docker file with the exception of a swagger generator util and unit test run steps for brevity. ``` FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base WORKDIR /app #Expose port for container app to listen on port 8080, because non-root user does not have permission to listen on privileged ports (< 1024) EXPOSE 8080 ENV ASPNETCORE_URLS=http://+:8080 #Add a new group "dotnet" with group id 10001 and new user "dotnet" with user id 10000 # comment this out when testing the "root" scenario RUN groupadd -g 10001 dotnet \ && useradd -m -u 10000 -g 10001 dotnet ENV DOTNET_RUNNING_IN_CONTAINER=true FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build WORKDIR /src COPY ["./LoanService.API/LoanService.API.csproj", "./LoanService.API/"] COPY ["./LoanService.API.DataAccess/LoanService.API.DataAccess.csproj", "./LoanService.API.DataAccess/"] COPY ["./nuget.config", "."] # Populates the nuget.config environment variable ARG Azure_DevOps_Artifact_PAT # Build RUN dotnet restore "./LoanService.API/LoanService.API.csproj" COPY . . WORKDIR /src RUN dotnet build "./LoanService.API/LoanService.API.csproj" -c Release -o /app # Publish FROM build AS publish RUN dotnet publish "./LoanService.API/LoanService.API.csproj" -c Release -o /app FROM base AS final WORKDIR /app COPY --from=publish /app . # Switch to non-root privileged user # comment this line out when testing the "root" version side by side USER dotnet:dotnet ENTRYPOINT [ "/bin/bash", "-c", "dotnet LoanService.API.dll"] ``` In the web api itself, here's the relevant bits of the csproj for reference: ``` net6.0 true $(NoWarn);1591 LoanService.API LoanService API Service Loan Servicing REST API service project InProcess Linux . ``` Program.cs logging, shouldn't really matter but including just in case somehow relevant: ``` public class Program { public static int Main(string[] args) { AtomHost.ConfigureLogger(args); return AtomHost.Run(CreateHostBuilder, args); } public static IHostBuilder CreateHostBuilder(string[] args) => Host.CreateDefaultBuilder(args) .UseAtomLogging() .ConfigureWebHostDefaults(webBuilder => { webBuilder.UseStartup(); }); } ... public static IHostBuilder UseAtomLogging(this IHostBuilder builder, Action? configure = null) { builder.ConfigureServices((builderContext, services) => { var configuration = builderContext.Configuration; var settingsOpt = services.Configure(configuration.GetSection("AtomApiLogging")); if (configure != null) settingsOpt.Configure(configure); services.AddScoped(); var aiOptions = new ApplicationInsightsServiceOptions { EnableAdaptiveSampling = false, ConnectionString = configuration.GetValue("ApplicationInsights:ConnectionString") }; if (string.IsNullOrEmpty(aiOptions.ConnectionString)) { aiOptions.InstrumentationKey = configuration.GetValue("ApplicationInsights:InstrumentationKey"); } services.AddHttpContextAccessor(); services.AddApplicationInsightsTelemetry(aiOptions); if (Type.GetType("Azure.Messaging.ServiceBus.ServiceBusClient, Azure.Messaging.ServiceBus") != null) { services.AddApplicationInsightsTelemetryProcessor(); } services.AddSensitiveDataRedactor(); services.AddLoggableRequestHeaders(); }); return builder.UseSerilog(); } .... (AtomHost) public static int Run(Func createHost, string[] args) { try { Log.Information("Starting Host"); createHost(args).Build().Run(); return 0; } catch (Exception ex) { Log.Fatal(ex, "Host terminated unexpectedly"); return 1; } finally { Log.Information("Host shut down"); Log.CloseAndFlush(); } } ``` And in its Startup.cs, 3 different methods of setting up HttpClients with client certificates all repro the issue: ConfigureServices ``` .... var lstCertThumbprint = Configuration.GetValue("LST:CertThumbPrint"); var lstCert = string.IsNullOrWhiteSpace(lstCertThumbprint) ? null : CertificateUtil.GetCertificateByThumbprint(lstCertThumbprint); var httpClientBuilder = services.AddHttpClient(); if (!string.IsNullOrWhiteSpace(lstCertThumbprint)) { // This client reproduces the issue httpClientBuilder = httpClientBuilder.UseClientCertificate(lstCertThumbprint); } .... services.AddHttpClient(nameof(LoanServHealthCheck)) .ConfigurePrimaryHttpMessageHandler(() => { var httpClientHandler = new HttpClientHandler(); // This client also reproduces the issue httpClientHandler.ClientCertificates.Add(lstCert); return httpClientHandler; }); .... // This client also reproduces the issue. Since the two out of the box above also reproduce it, // I wont bother including the implementation of this soap client as its rather convoluted, // but leaving this here to point out that it also happens for an Http soap binding with the // mutual TLS cert as well. services.AddLoanServSoapClientFactory( "GetAccountBySSN_PI00WEBS", "wwsp1000", lstCert ); ``` We inject the cert into the Azure AppService via the out of the box Azure AppService private cert upload https://learn.microsoft.com/en-us/azure/app-service/configure-ssl-certificate?tabs=apex%2Cportal Note that the top level CA in the chain of the cert should be in the linux containers ca store, however the mid-level intermediate in the chain should NOT be, and the cert must include AIA information in its x509v3 chain with which to download the DER of the intermediate cert in order to fully reproduce our scenario. Then just use some of those http clients to make some mutual TLS calls. If you monitor traffic from the container via AppInsights, you'll see the dependency calls going out to the AIA target url. I assume you could also capture traffic with Fiddler or something instead. ### Expected behavior When running as non-root, dotnet should download the missing cert chain via AIA and save it in the local dotnet /ca/ store just like it does when running as root, and should not continue to redownload the chain for each subsequent request. ### Actual behavior After changing the user our dotnet api was running under from root to non-root, we noticed a massive uptick (from basically zero to >3million per day) in dependency calls to http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt when making a mutual TLS call to our internal gateway (which has a server SSL cert with the same chain as the mutual TLS cert we send). Every single outbound https call to our gateway would make two GET requests to this sectigo URL. We had absolutely no references to that url ourselves so we began digging and found that that URL comes from the Sectigo certificates AIA section, and from there learned about how AIA is used to download incomplete chains. We converted the service back to running as root user, and found that it made the download request twice on first http call and then never makes it again for the life of the container (still why twice? two chain paths?). After digging through dotnet runtime issues and various other places we found that on linux the place where dotnet is supposed to be storing the just-in-time AIA chains is here and I added some debug endpoints to dump the internals of the container because our security team wont let us turn on SSH for some reason: ![image](https://user-images.githubusercontent.com/7151703/215932380-d49b5e61-115a-4ed9-9687-3222260e1884.png) (note /root/ when running as root, otherwise the users home when running as non-root). Here you can see that we have two revocations cached in the /crls/ folder and 4 pfx's cached in the /ca/ folder. When running as non-root, dotnet still creates this directory structure including the /ca/ folder (under the non-root users home dir), and populates the two /crls/ entries (so clearly it has write access to this directory structure). However, the /ca/ folder remains empty. Etw logs seem to show this entry: `ExceptionMessage="The owner of '/home/dotnet/.dotnet/corefx/cryptography/x509stores/ca' is not the current user"` Since both the gateways SSL cert and the client mutual TLS cert use the same offending chain, we eliminated the mutual TLS cert from the equation by ceasing to send it and removing the requirement for it from the gateway, and the sectigo calls vanished entirely so we can say for certain that it is the mutual TLS client cert that triggers this behavior (and not the target servers SSL cert). Presumably something is going on that's causing dotnet to be unable / refuse to cache these missing parts of the cert chain that dotnet wants to send to the gateway as part of the mutual TLS handshake and is redownloading them (twice..?) from the AIA information in every request when running as non-root. Once I understood the problem domain enough I was able to work around the issue by explicitly installing the .crt version of http://crt.sectigo.com/SectigoRSAOrganizationValidationSecureServerCA.crt (which is actually a DER, not crt, not relevant to this bug as .NET doesn't care about the extension when parsing but mentioning in case someone looks at it and ends up trying to use / test it as a CRT when it is in fact a DER. The equivalent CRT can actually be obtained from here https://support.sectigo.com/articles/Knowledge/Sectigo-Intermediate-Certificates ). With the intermediate ca cert explicitly imported into the linux ca store, we now get 0 calls to sectigo (vs 2 calls on first http call when running as root, and vs 2 calls per one http request when running as non-root). So now notably I'm assuming the reason many people haven't run into this before is likely because A) they're running as root, or B) the mutual TLS certs they're using have the full chain on the box (or have the full chain self contained in the cert they use? Not sure if that's a normal thing). So while I have worked around the issue, I couldn't find any documentation of others running into the same thing, closest relative seems to be this, https://github.com/dotnet/runtime/issues/29653 which I learned a lot of useful info from. At a glance I'm guessing the problem is around here https://github.com/dotnet/runtime/blob/983c8f239a98812498d874ed36b7001bd764fdfe/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslDirectoryBasedStoreProvider.cs or here https://github.com/dotnet/runtime/blob/af263f7a2b0a309b5ac79ad92f4f7217da906b78/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainProcessor.cs#L268 but I'm far out of my comfort zone at this point so it's just pure speculation. ### Regression? Dont know. ### Known Workarounds Install the intermediate certs into the container with Docker ``` COPY ["./LoanService.API/Resources/SectigoRSAOrganizationValidationSecureServerCA.crt", "/usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt"] RUN chmod 644 /usr/local/share/ca-certificates/SectigoRSAOrganizationValidationSecureServerCA.crt \ && update-ca-certificates ``` so that dotnet doesn't need to try to download them from AIA. ### Configuration .NET 6 Debian 11, per the 6.0 tag here: https://hub.docker.com/_/microsoft-dotnet-aspnet#:~:text=6.0.13%2Dbullseye%2Dslim%2Damd64%2C%206.0%2Dbullseye%2Dslim%2Damd64%2C%206.0.13%2Dbullseye%2Dslim%2C%206.0%2Dbullseye%2Dslim%2C%206.0.13%2C%206.0 unrelated to port 8080 change, running as root port 8080 does not reproduce the issue. ### Other information _No response_
Author: EklipZgit
Assignees: -
Labels: `area-System.Security`, `untriaged`
Milestone: -
EklipZgit commented 1 year ago

Etw logs seem to show this entry: ExceptionMessage="The owner of '/home/dotnet/.dotnet/corefx/cryptography/x509stores/ca' is not the current user"

Which is odd because dotnet clearly created this directory and wrote stuff to the directory next to it. There's nothing special in our containers, and the entirety of omitted parts of my dockerfile are in the build/publish image steps, not the base/final images, so nothing dotnet gets run in that image besides entry point.

Took me a bit but I got it running with strace, these were the changes I made image (stop installing the workaround intermediate ca cert, installing strace as root, go back to running strace as root (wont run under dotnet user) + telling strace to run the command under the dotnet user instead). Verified problem reproducing during this time, so the strace run-as-user worked correctly: image

Strace log file: strace.log

For what its worth, I don't see anything in there that mentions part of that directory structure, but I don't really know what you're looking for in here anyway.

wfurt commented 1 year ago

Can you check what UID is dotnet user in the container and check directory ownership all the way to home? Hopefully @bartonjs will chime in with some more insight. I don't know if or why the X509Chain would care but I;'ve seen something like this while about with openssh.

EklipZgit commented 1 year ago

how would I check directory ownership all the way to home? Is there something specific you'd like run? Note that I don't have SSH access to the container in Azure and I've not bothered to set up a local repro since the cert is injected via azure (although I have the raw binary now of what Azure injects so I could in theory probably set up a full local repro, just haven't bothered).

If theres some dotnet way to grab the info you need, I already have a debug endpoint to recursively walk the file system and write stuff back out (but note, that's running as the dotnet user), that's probably easier for me than any shell command + setting up the container repro locally if that works. If shell command as root is only option though, I can set that up. Just need to know what you're interested in (as a non linux user I have no idea what you're looking for)

wfurt commented 1 year ago

I was thinking about good old ls -al but that won't work without SSH ;( Let's wait for @bartonjs to avoid unnecessary work.

bartonjs commented 1 year ago

https://github.com/dotnet/runtime/blob/983c8f239a98812498d874ed36b7001bd764fdfe/src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslDirectoryBasedStoreProvider.cs#L350-L381

If you can run the Docker image locally, you should be able to just do something like docker exec -it <containerid> /bin/bash against the already running container to have a shell to snoop around inside. ls -al /home/dotnet/.dotnet/corefx/cryptography/x509stores/ will presumably show the directory owner to be something other than your "dotnet" user. (If that shows as "dotnet", then geteuid() said you were being someone else)

As for why CRLs would be cached but certs not... certs in the cert store might have associated private keys, so the cert store treats everything as confidential. CRLs don't really have user secrets in them, so we don't use as stringent of checks before allowing them to be written to disk.

EklipZgit commented 1 year ago

So we're trying to reproduce in a local container, and it actually works fine so the assumption is that something that Azure AppService / Kudu modifies the container with is what is causing the issue. The way I set up strace above still doesn't include any of the stuff we're looking for, even locally, so I must just be missing some strace params or something (any suggestions, if either of you still care?)

I got an endpoint in the appservice running ls commands (as dotnet user).

total 0
drwxrwxrwx 2 nobody nogroup 0 Nov 22 22:20 .
drwxrwxrwx 2 nobody nogroup 0 Nov 22 22:20 ..
drwxrwxrwx 2 nobody nogroup 0 Nov 22 22:20 ca

Is what we have for ls -al /home/dotnet/.dotnet/corefx/cryptography/x509stores/ so indeed, the owner is not what is expected.

Azure AppServices must do something to map the filesystem out for Kudu to read outside the container but I'm not sure what the implications of that are on the access. Given nobody nogroup above I imagine this might be a more likely culprit https://learn.microsoft.com/en-us/archive/blogs/waws/things-you-should-know-web-apps-and-linux#you-cannot-change-permissions-on-the-home-directory-when-persisting-storageapplies-to-web-app-for-containers which I'm not 100% sure applies to our appservice deployment -- still need to poke around and see how this is set up -- but the section right above that says the default is true. So by default in Azure AppServices hosting Linux containers, you have no control over the permissions in the /home directory. Maybe /home isn't a safe place for this cache to be stored if intending to play nice with Azure AppServices? I made sure we were following the standard guidelines for running aspnet webapi on linux as non-root and saw no mention of this as a pitfall.

From a behavior perspective, perhaps private key-less ca certs (which anything from AIA would be?) shouldn't be subject to the same write-checks that other forms of cert writing are? Different store, or something? I guess, regardless of whatever weird thing Azure AppServices is doing to this container, the behavior we're getting still seems a bit questionable. Even a terminating error would probably have been preferable vs discovering this when Sectigo goes down and every request we make suddenly skyrockets to 30 seconds of timeouts trying to redownload the cert until they come back up (2 cert downloads per call for whatever reason, 15 second timeout on each is the behavior we saw).

I'll keep poking around and try to sort out specifically what Azure AppServices is doing that changes these permissions.

wfurt commented 1 year ago

is there anything else in that directory? That + timestamp may be hint what is creating it. running as nobody is not that unusual but HOME must be probably set for dotnet already. You can try to remove it as last step of the image creation. That would give hint if this happens during building the image or on first run.

I also often set

DOTNET_CLI_TELEMETRY_OPTOUT=1
DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1

to avoid interference with base function.

BTW when fiddling with strace make sure you pass -f flag. Based on the discussion this may not find any more info but it is great tool for cases like this.

EklipZgit commented 1 year ago

I just edited the above comment (had a bad SO link originally), but check out https://learn.microsoft.com/en-us/archive/blogs/waws/things-you-should-know-web-apps-and-linux#you-cannot-change-permissions-on-the-home-directory-when-persisting-storageapplies-to-web-app-for-containers this section and the section right above it.

I tried playing with this directory during image creation and it certainly didn't exist during the docker build, but given the above info it looks like by default it just persists between container deployments and always has 777 permissions, even if you try to change them. The runtime and Azure AppServices appear to have conflicting assumptions about what can be done with the /home directory.

richlander commented 1 year ago

FYI: Those ENVs should have no effect since they are SDK oriented and nothing in the SDK should be affecting the final ASP.NET images.

I'll ask someone in app service for help. Thanks for the report! This is quite interesting.