Azure / service-fabric-mesh-preview

Service Fabric Mesh is the Service Fabric's serverless offering to enable developers to deploy containerized applications without managing infrastructure. Service Fabric Mesh , aka project “SeaBreeze” is currently available in private preview. This repository will be used for tracking bugs/feature requests as GitHub issues and for maintaining the latest documentation.
MIT License
83 stars 10 forks source link

Frequent timeouts/failures when deploying locally #355

Closed kentcb closed 5 years ago

kentcb commented 5 years ago

I'm getting frequent local deployment problems, solved by resetting my cluster (which takes quite some time). Here is an example build log (detailed).

I'm using the latest SDK (3.3.644). This is running in a Windows 10 VM on Azure using latest VS2017 bits.

Also, I did check the SF explorer after and it didn't appear to tell me anything useful regarding the problem.

kentcb commented 5 years ago

Literally every time I try and deploy right now. This is murdering my productivity.

kentcb commented 5 years ago

Digging around event viewer, I see a slew of errors from the deployment, including all these.

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          3/13/2019 5:17:52 PM
Event ID:      58368
Task Category: FabricDeployer
Level:         Error
Keywords:      Default
User:          Development\kent
Computer:      Development
Description:
Failed to execute command dockerd -H localhost:2375 -H npipe:// --pidfile=C:\SfDevCluster\Data\_sf_docker_pid\636880942720384145_sfdocker.pid from working dir C:\Program Files\Docker wait for exit False exception System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified
   at System.Diagnostics.Process.StartWithCreateProcess(ProcessStartInfo startInfo)
   at System.Fabric.FabricDeployer.Utility.ExecuteCommand(String command, String arguments, String workingDir, Boolean waitForExit, Nullable`1 timeout)
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
    <EventID>58368</EventID>
    <Version>2</Version>
    <Level>2</Level>
    <Task>228</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2019-03-13T07:17:52.090514100Z" />
    <EventRecordID>3739</EventRecordID>
    <Correlation />
    <Execution ProcessID="3120" ThreadID="3868" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>Development</Computer>
    <Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">FabricDeployer</Data>
    <Data Name="message">Failed to execute command dockerd -H localhost:2375 -H npipe:// --pidfile=C:\SfDevCluster\Data\_sf_docker_pid\636880942720384145_sfdocker.pid from working dir C:\Program Files\Docker wait for exit False exception System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified
   at System.Diagnostics.Process.StartWithCreateProcess(ProcessStartInfo startInfo)
   at System.Fabric.FabricDeployer.Utility.ExecuteCommand(String command, String arguments, String workingDir, Boolean waitForExit, Nullable`1 timeout)</Data>
  </EventData>
</Event>
Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          3/13/2019 5:17:52 PM
Event ID:      58368
Task Category: FabricDeployer
Level:         Error
Keywords:      Default
User:          Development\kent
Computer:      Development
Description:
Failed to clean up container network.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
    <EventID>58368</EventID>
    <Version>2</Version>
    <Level>2</Level>
    <Task>228</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2019-03-13T07:17:52.910937900Z" />
    <EventRecordID>3741</EventRecordID>
    <Correlation />
    <Execution ProcessID="3120" ThreadID="3868" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>Development</Computer>
    <Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">FabricDeployer</Data>
    <Data Name="message">Failed to clean up container network.</Data>
  </EventData>
</Event>
Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          3/13/2019 5:18:47 PM
Event ID:      2560
Task Category: General
Level:         Error
Keywords:      Default
User:          Development\kent
Computer:      Development
Description:
GetFileAttributesEx failed with the following error 5
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{cbd93bc2-71e5-4566-b3a7-595d8eeca6e8}" />
    <EventID>2560</EventID>
    <Version>2</Version>
    <Level>2</Level>
    <Task>10</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2019-03-13T07:18:47.548031400Z" />
    <EventRecordID>3762</EventRecordID>
    <Correlation />
    <Execution ProcessID="12756" ThreadID="9648" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>Development</Computer>
    <Security UserID="S-1-5-21-813763744-1233322525-2174010731-500" />
  </System>
  <EventData>
    <Data Name="id">File.GetLastWriteTime</Data>
    <Data Name="type">File</Data>
    <Data Name="text">GetFileAttributesEx failed with the following error 5</Data>
  </EventData>
</Event>
anantshankar17 commented 5 years ago

From this build log, all I can make out is that the cluster was running but the mesh application deployment failed:

3> Deploying application to local Service Fabric cluster... 3> C:\Windows\sysnative\WindowsPowerShell\v1.0\powershell.exe -NoProfile -ExecutionPolicy Bypass -Command "import-module 'C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Mesh\Scripts\PSModule\Microsoft.ServiceFabric.Powershell.Http.psd1'; Connect-SFCluster; New-SFMeshResourceDeployment -ResourceDescriptionList 'C:\Users\kent\Repository\<>\src\<>\MESH\<>\App Resources\app.yaml', 'C:\Users\kent\Repository\<>\src\<>\MESH\MyService\Service Resources\service.yaml', 'C:\Users\kent\Repository\<>\src\<>\MESH\API\Service Resources\service.yaml', 'C:\Users\kent\Repository\<>\src\<>\MESH\<>\obj\SFApp\App.debug.yaml' -ParameterFileName 'C:\Users\kent\Repository\<>\src\<>\MESH\<>\Environments\Local\parameters.yaml'" 3> Connected Successfully! 3> Deploying the application to the Service Fabric local cluster failed. See Service Fabric Explorer for additional details. 3> Done executing task "LocalDeployApplication" -- FAILED.

If would help if you can reset your cluster again, and try app deployment and share the cluster traces from here: C:\SFDevCluster\Log\Traces, I can look and tell you why the app deployment is failing.

Our messages collided. From the event logs it seems to be an issue with the docker failing to start the process. I would suggest you to reboot your machine or reset docker to factory settings.

kentcb commented 5 years ago

@anantshankar17 Thanks. The traces are large. Can you provide an email address with which I can share them via OneDrive?

anantshankar17 commented 5 years ago

You can share the traces to this id: ashank at microsoft dot com. Also did you try to reboot machine or reset the docker to factory settings ? From the event logs, it seems the docker is failing to start the container, in which case SF traces will also indicate the same thing only.

kentcb commented 5 years ago

OK, it's still happening even after Docker reset/SF reset/reboot, so I've shared some traces with you.

anantshankar17 commented 5 years ago

@kentcb Also, can you please check if there is sufficient free disk space on the disk the cluster is deployed ? Can you please share the resource descriptions that you are trying to deploy ? Which VM type are you using ? How many apps have been deployed on this cluster ?

Since, I see all kinds of insufficient resources available traces i suspect this is some incorrect config in the resource description:

2019-3-13 22:48:37.859 PLB.InsufficientResourcesDetected 9900 8640 Insufficient resources when adding service fabric:/inf/API for metric servicefabric:/_CpuCores: Requested 500000, Available 0, Reserved0 BufferPercentage 0, Capacity 1000000

2019-3-13 22:48:37.876 CM.CreateSingleInstanceDeploymentTraceComponent 9900 1932 [(00000000-0000-0000-0000-000000002000:131969890879859490)+eee2219d-a093-4836-817b-eff781ddbefa:0] single instance deployment create failed due to FABRIC_E_INSUFFICIENT_CLUSTER_CAPACITY

2019-3-13 22:51:01.446 FabricDCA.AppConfig 6096 11704 Manifest C:\SfDevCluster\Data_App_Node_0\SingleInstance_0_App2\App.1.0.xml was not found. This could be because the application has been deleted, but we haven't yet processed the event that notifies us about the deletion.

Also, you can use this cmdlet to check the resource availability on the cluster: (Get-ServiceFabricClusterLoadInformation).LoadMetricInformationList | select Name, ClusterCapacity, ClusterLoad, IsClusterCapacityViolation https://docs.microsoft.com/en-us/powershell/module/servicefabric/Get-ServiceFabricClusterLoadInformation?view=azureservicefabricps#outputs

kentcb commented 5 years ago

@anantshankar17 There's over 54GB free disk space, so guessing that's not the issue.

Those resource requests do look odd! At this point it's just a spike project, so only two dummy services. Here are their definitions:

API

service.yaml

## Service definition ##
application:
  schemaVersion: 1.0.0-preview2
  name: inf
  properties:
    services:
      - name: API
        properties:
          description: API description.
          osType: Windows
          codePackages:
            - name: API
              image: api:dev
              endpoints:
                - name: gRPCEndpoint
                  port: 9026
              environmentVariables:
                 - name: ENVIRONMENT
                   value: "[parameters('ENVIRONMENT')]"
                 - name: RESOURCE_GROUP
                   value: "[parameters('RESOURCE_GROUP')]"
                 - name: SEQ_SERVER_URL
                   value: "[parameters('SEQ_SERVER_URL')]"
                 - name: SEQ_API_KEY
                   value: "[parameters('SEQ_API_KEY')]"
                 - name: GRPC_PORT
                   value: "[parameters('GRPC_PORT')]"
              resources:
                requests:
                  cpu: 0.5
                  memoryInGB: 1
          replicaCount: 1
          networkRefs:
            - name: infNetwork
              endpointRefs:
                - name: gRPCEndpoint

Dockerfile

FROM microsoft/dotnet:2.1-runtime-nanoserver-1803 AS base
WORKDIR /app
EXPOSE 9026

FROM microsoft/dotnet:2.1-sdk AS build
WORKDIR /src
COPY API/API.csproj API/
RUN dotnet restore API/API.csproj
COPY . .
WORKDIR /src/API
RUN dotnet build API.csproj -c Release -o /app

FROM build AS publish
RUN dotnet publish API.csproj -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "API.dll"]

InvitationCodes

service.yaml

## Service definition ##
application:
  schemaVersion: 1.0.0-preview2
  name: inf
  properties:
    services:
      - name: InvitationCodes
        properties:
          description: InvitationCodes description.
          osType: Windows
          codePackages:
            - name: InvitationCodes
              image: invitationcodes:dev
              environmentVariables:
                 - name: ENVIRONMENT
                   value: "[parameters('ENVIRONMENT')]"
                 - name: RESOURCE_GROUP
                   value: "[parameters('RESOURCE_GROUP')]"
                 - name: SEQ_SERVER_URL
                   value: "[parameters('SEQ_SERVER_URL')]"
                 - name: SEQ_API_KEY
                   value: "[parameters('SEQ_API_KEY')]"
                 - name: GRPC_PORT
                   value: "[parameters('GRPC_PORT')]"
              resources:
                requests:
                  cpu: 0.5
                  memoryInGB: 1
          replicaCount: 1
          networkRefs:
            - name: infNetwork

Dockerfile

FROM microsoft/dotnet:2.1-runtime-nanoserver-1803 AS base
WORKDIR /app

FROM microsoft/dotnet:2.1-sdk AS build
WORKDIR /src
COPY InvitationCodes/InvitationCodes.csproj InvitationCodes/
RUN dotnet restore InvitationCodes/InvitationCodes.csproj
COPY . .
WORKDIR /src/InvitationCodes
RUN dotnet build InvitationCodes.csproj -c Release -o /app

FROM build AS publish
RUN dotnet publish InvitationCodes.csproj -c Release -o /app

FROM base AS final
WORKDIR /app
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "InvitationCodes.dll"]

App Resources

app.yaml

application:
  schemaVersion: 1.0.0-preview2
  name: inf
  properties:
    description: inf description.

gateway.yaml

gateway:
  schemaVersion: 1.0.0-preview2
  name: infGateway
  properties:
    description: INF Gateway Resource
    sourceNetwork:
      name: Open
    destinationNetwork:
      name: infNetwork
    tcp:
      - name: gRPCTCP
        port: 9026
        destination:
          applicationName: inf
          serviceName: API
          endpointName: gRPCEndpoint

network.yaml

## Network definition ##
network:
  schemaVersion: 1.0.0-preview2
  name: infNetwork
  properties:
    description: infNetwork description.
    networkAddressPrefix: 10.192.0.0/16
    kind: Local
anantshankar17 commented 5 years ago

@kentcb which VM type are you using, how many cores does that have ? Gateway resource would itself use 1 core, and your service will also consume 1.

After app deployment, did you run this cmdlet and check the resource utilization ? (Get-ServiceFabricClusterLoadInformation).LoadMetricInformationList | select Name, ClusterCapacity, ClusterLoad, IsClusterCapacityViolation https://docs.microsoft.com/en-us/powershell/module/servicefabric/Get-ServiceFabricClusterLoadInformation?view=azureservicefabricps#outputs

anantshankar17 commented 5 years ago

Please reopen if you see this issue even after resolving the resource constraints.