Azure / aksArc

# Welcome to the Azure Kubernetes Service on Azure Stack HCI repo This is where the AKS-HCI team will track features and issues with AKS-HCI. We will monitor this repo in order to engage with our community and discuss questions, customer scenarios, or feature requests. Checkout our projects tab to see the roadmap for AKS-HCI!
MIT License
109 stars 45 forks source link

[BUG] Using DownloadSDK stil fails to download files #373

Open OEI-Cgray opened 1 month ago

OEI-Cgray commented 1 month ago

Describe the bug DownloadSDK passes its tests during the Set-AksHciConfig, but still fails to actually download the entire file. When it fails, it completely closes the existing powershell window as well, with no mention of a log file... To even see the error message, I had to run powershell within powershell so on the first process would be killed, leaving the error visible to see.

To Reproduce Steps to reproduce the behavior:

  1. Setup new AKSHCI cluster.

Expected behavior Files to actually download.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Collect log files

Set-AksHciConfig -imageDir "c:\clusterstorage\containerstorage\AKS-Images" -cloudConfigLocation "c:\clusterstorage\containerstorage\AKS-Config" -workingDir "c:\clusterstorage\containerstorage\AKS-WorkingDir" -vnet $vnet -cloudservicecidr "10.42.0.129/23"
True
DownloadSdk Tests Execution Started 16.05.2024 09:03:44
===============================================================================

Test (1 of 1): "Validate DownloadSDK Host Firewall URL Requirements". Category: DownloadSDK Host
Testing firewall url requirements
Connection to https://msk8s.api.cdp.microsoft.com ... Succeeded
Connection to http://msk8s.b.tlu.dl.delivery.mp.microsoft.com ... Succeeded

Host is able to reach list of URLs requirements

Test Succeeded
Details: Host is able to reach list of URLs requirements
Recommendation:
Test execution time: 244.9995 milliseconds

DownloadSdk Tests Execution Ended 16.05.2024 09:03:44

=====================================================
       All DownloadSdk Validation tests are successful
=====================================================

Check the test report(downloadsdk_validation_report.html) in current directory

Kva Tests Execution Started 16.05.2024 09:03:45
===============================================================================

Test (1 of 1): "Validate KVA". Category: KVA

panic: totalWritten != stat.Size()

goroutine 145 [running]:
github.com/vbauerster/getparty.Session.concatenateParts({{0x1c0002c8690, 0xef}, {0x1c0002c8690, 0xef}, {0x1c00001e489, 0x6e}, {0x0, 0x0}, {0x1c0001efec5, 0x5}, ...}, ...)
        /home/vsts/go/pkg/mod/github.com/vbauerster/getparty@v1.19.5-0.20231024204429-dbaf1ad02b99/session.go:146 +0x15a5
github.com/vbauerster/getparty.(*Cmd).Run(0x1c000407080, {0x1c00011e100, 0x5, 0x8}, {0x0, 0x0}, {0x0, 0x0})
        /home/vsts/go/pkg/mod/github.com/vbauerster/getparty@v1.19.5-0.20231024204429-dbaf1ad02b99/getparty.go:318 +0x1d85
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/http.DownloadFile({0x7ffc2e2ac348?, 0x1c0001874a0}, {0x1c0002c8690, 0xef}, {0x1c00027e070, 0x6e}, 0xa, {0x0, 0x0}, 0x1c0001b6160)
        /home/vsts/work/1/s/sdk/pkg/http/http.go:85 +0x72f
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/sfsclient.(*SFSClient).GetVerifiedFiles(0x1c0002bd008, {0x7ffc2e2ac348, 0x1c0001874a0}, {{{0x1c000111760, 0x1c}, {0x7ffc2e1d76f0, 0x7}, {0x1c000498144, 0xc}}}, {0x1c0001e3680, ...}, ...)
        /home/vsts/work/1/s/sdk/pkg/sfsclient/sfsclient.go:187 +0x85b
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/download/provider/sfs.(*sfsProvider).getReleaseInternal(0x1c0002bd020, {0x7ffc2e2ac348, 0x1c0001874a0}, {0x1c000111760?, 0x0?}, {0x1c000498144?, 0x0?}, {0x1c0001e3680, 0x46}, 0xa, ...)
        /home/vsts/work/1/s/sdk/pkg/download/provider/sfs/release.go:49 +0x454
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/download/provider/sfs.(*sfsProvider).GetRelease(0x0?, {0x7ffc2e2ac348?, 0x1c0001874a0?}, {0x1c000111760?, 0x0?}, {0x1c000498144?, 0x0?}, {0x0?, 0x0?}, {0x0, ...}, ...)
        /home/vsts/work/1/s/sdk/pkg/download/provider/sfs/release.go:16 +0x96
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/download.(*downloadClient).getRelease(0x1c0003107b0, {0x7ffc2e2ac348, 0x1c0001874a0}, {{0x1c000111760, 0x1c}, {0x1c000498144, 0xc}, {0x0, 0x0}, {0x0, ...}, ...})
        /home/vsts/work/1/s/sdk/pkg/download/release.go:72 +0x2ad
msazure.visualstudio.com/msazure/msk8s/downloadsdk.git/sdk/pkg/download.(*downloadClient).GetRelease(0x0?, {0x7ffc2e2ac348?, 0x1c0001874a0?}, {{0x1c000111760, 0x1c}, {0x1c000498144, 0xc}, {0x0, 0x0}, {0x0, ...}, ...})
        /home/vsts/work/1/s/sdk/pkg/download/release.go:55 +0x13d
main.GetRelease.func2()
        /home/vsts/work/1/s/sdk/main.go:175 +0x97
created by main.GetRelease
        /home/vsts/work/1/s/sdk/main.go:174 +0x5ec
Elektronenvolt commented 1 month ago

@OEI-Cgray I'm familiar with quite some error messages at stage "Validate KVA" but I've never seen URLs .github.com and .visualstudio.com at AKS Arc setups. What exactly are you trying to do?

OEI-Cgray commented 1 month ago

@Elektronenvolt , trying to install it. I couldn't even set the config, let alone run install-akshci.

This is the same issue I've seen before, downloads in my environment, with no restrictions on outgoing traffic, fail to download some file. It's not always the same file. If I enable logging and verbose output, I can find the URL that fails to download, and then download it with invoke-webrequest from the same machine, even the same powershell session. But using the scripts, which download with the DownloadSDK module, fails.

Because the download size did not match the expected download size, downloadsdk panicked, and killed the process.

The file paths are for the DownloadSDK module, which I assume at the moment is a DLL, those would be the local source folders of whatever machine built the release DLL, or something along those lines. They are there because of the panic crash.

Elektronenvolt commented 1 month ago

@OEI-Cgray - I never have seen that issue in any of the setups I did. Looks like an issue with the underlying storage. Your download with Invoke-Webrequest to the same storage you've set for AKS Arc? And these days - endpoint protections got better - download from PS may treated as malicious activity. What are the specs of the storage? SSDs only?

OEI-Cgray commented 1 month ago

@Elektronenvolt

The storage is all enterprise class SSDs, and works fine when we were able to install this over a year ago before these same download issues made us abandon it. And yes, downloaded to the same storage.

Defender is our endpoint protection, and it does not stop the invoke-webrequest test. That's not to say the activity wasn't flagged when downloading via DownloadSDK, but when checking, Defender currently says it's never detected a threat on the three machines that make up the cluster. The machines have no threat detection history according to the various Azure portals either.

I'm happy to generate logs or whatever else needed to troubleshoot this... meanwhile we'll also be setting up all the things to basically do this manually by installing k8s on some vms and then installing the arc integrations afterwards... which is all less ideal than hopefully getting this issue resolved.

Elektronenvolt commented 1 month ago

@OEI-Cgray Interesting issue - I'm always curious to know the root cause, a lot of issues I've seen with AKS Arc the last years had been caused by our infrastructure or 'limitations' like Firewalls, permissions, ... In case it comes down to be infra related, would be nice to know the root cause. In which Azure region do you run the setup? I'm in West Europe

Well - try to run with debug and verbose flags on, may you see anything interesting in the debug and verbose output. $VerbosePreference = "continue"
$DebugPreference = "continue"

If that doesn't point you to something interesting, looks like it's time for a support case.

Elektronenvolt commented 1 month ago

@OEI-Cgray - by searching for something else I've seen that AKS Arc is right now supported on these three regions only: Azure-Regions You may run the setup in a non-supported region?

OEI-Cgray commented 4 weeks ago

@Elektronenvolt The resource group is in west us, but the download error occurs on the Set-AksHciConfig, as far as I'm aware, you specify the resource group on the next command, Set-AksHciRegistration.

I'll have some time finally towards the end of this week for more testing.

Elektronenvolt commented 3 weeks ago

@OEI-Cgray According to docs West US is not supported. Yes, true - would also expect a region-based error at the step where you set the resource group at Set-AksHciRegistration

Continue testing - the June release is out and, in your case - try the offline download and offsite setup

If that works, consider Internet connectivity issues like https://github.com/Azure/aksArc/issues/355

OEI-Cgray commented 1 week ago

Hello again, I finally got some time to try the offline download. Attached is the end of the results. It did pass all the 9 tests without any issues.

Test-OfflineDownloadFiles : Cannot bind argument to parameter 'DifferenceObject' because it is null.
At C:\Program Files\WindowsPowerShell\Modules\AksHci\1.2.4\AksHci.psm1:7656 char:17
+ ...             Test-OfflineDownloadFiles -destination $destination -vers ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Test-OfflineDownloadFiles], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Test-OfflineDownloadFiles

aks-log.txt

OEI-Cgray commented 1 week ago

Digging into this more, this function doesn't seem to exist.

It's referenced a few lines before the line 7656 error.

    $updateReleases = Get-ProductReleasesUptoVersion -Version $startVersion -moduleName $moduleName
    $updateReleases | ForEach-Object  {
        $version = $_.Version
        if ([System.Version]$version -ge [System.Version]$startVersion)
        {
            $destination = $global:config[$modulename]["stagingShare"] + "\" + $version
            if (Test-Path $destination) {
                Test-OfflineDownloadFiles -destination $destination -version $version
            } else {
                New-Item -Path $destination -ItemType Directory
                Get-ReleaseContent -version $version -activity $activity -destination $destination -moduleName $moduleName -mode $mode
            }
        }
    }

This does not seem to exist.

Get-ProductReleasesUptoVersion
Get-ProductReleasesUptoVersion : The term 'Get-ProductReleasesUptoVersion' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path
was included, verify that the path is correct and try again.
At line:1 char:1
+ Get-ProductReleasesUptoVersion
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Get-ProductReleasesUptoVersion:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException
Elektronenvolt commented 1 week ago

Line 7656 is Test-OfflineDownloadFiles -destination $destination -version $version and it complains that DifferenceObject is null. If you go to the function in line 8450 - it wants to get the name of the downloaded files: $missingFiles = Compare-Object -ReferenceObject $releaseFiles.Name -DifferenceObject $downloadedFiles.Name

In your log the files I see a user profile path - like: Saving to: "C:\\Users\\myuser-~1\\AppData\\Local\\Temp\\dsdk-330135\\manifest.cab May you simply have permission issues on file access.

Did you specify directories like in the docs? https://learn.microsoft.com/en-us/azure/aks/hybrid/offline-download-22h2#step-4-configure-the-deployment-onsite I don't remember these files ending up in user profile temp folders e.g. -> did you use new / non existing folder names like in the docs?

OEI-Cgray commented 1 week ago
Connect-AzAccount -TenantId '6e0faf7f-ffe0-4345-bf85-9f1011650754' -Subscription '80f4e056-b7b8-4f94-b84a-469cdfd59b48'

Import-Module AksHci

$vnet = New-AksHciNetworkSetting -name k8sdhcpvnet -vswitchName "10.42.x.x" -vipPoolStart "10.42.24.40" -vipPoolEnd "10.42.25.240" -vlanid 24

Set-AksHciConfig -offlineDownload $true -mode full -stagingShare 'C:\clusterstorage\containerstorage\AKS-Staging' -imageDir 'c:\clusterstorage\containerstorage\AKS-Images' -cloudConfigLocation 'c:\clusterstorage\containerstorage\AKS-Config' -workingDir 'c:\clusterstorage\containerstorage\AKS-Working' -vnet $vnet -cloudservicecidr '10.42.0.129/23'

The folders were all empty when starting.

I don't see this function, which is called before Test-OfflineDownloadFiles, in the PSM: Get-ProductReleasesUptoVersion

I can get the initial version, but then it fails due to the non-existing (as far as I can tell) function:

PS C:\Temp> Get-AksHciVersion
VERBOSE: [06/25/2024 13:53:43] [AksHci] Initializing environment
VERBOSE: [06/25/2024 13:53:43] [AksHci] Importing Configuration
VERBOSE: [06/25/2024 13:53:43] [Moc] Importing Configuration
VERBOSE: [06/25/2024 13:53:43] [Moc] Importing Configuration Completed
VERBOSE: [06/25/2024 13:53:44] [Moc] Validating configuration
VERBOSE: [06/25/2024 13:53:44] [Moc] Get MOC Configuration
VERBOSE: [06/25/2024 13:53:44] [Moc]   Installation state is: NotInstalled
VERBOSE: [06/25/2024 13:53:44] [Kva] Importing Configuration
VERBOSE: [06/25/2024 13:53:44] [Kva] Importing Configuration Completed
VERBOSE: [06/25/2024 13:53:44] [Kva] Getting configuration for Kva
VERBOSE: [06/25/2024 13:53:44] [Kva]   Installation state is: NotInstalled
VERBOSE: [06/25/2024 13:53:44] [AksHci] Importing Configuration Completed
VERBOSE: [06/25/2024 13:53:44] [AksHci] Saving Configuration for Module AksHci to configuration file
VERBOSE: [06/25/2024 13:53:44] [Moc] Discovering configuration
VERBOSE: [06/25/2024 13:53:44] [Moc] Importing Configuration
VERBOSE: [06/25/2024 13:53:45] [Moc] Importing Configuration Completed
VERBOSE: [06/25/2024 13:53:45] [Moc] Validating configuration
VERBOSE: [06/25/2024 13:53:45] [Moc] Applying configuration
VERBOSE: [06/25/2024 13:53:45] [Moc] Saving Configuration for Module Moc to configuration file
VERBOSE: [06/25/2024 13:53:45] [Moc] Saving Configuration for Module Moc to configuration file
VERBOSE: [06/25/2024 13:53:45] [AksHci] Uninitializing environment
VERBOSE: [06/25/2024 13:53:45] [AksHci] Saving Configuration for Module AksHci to configuration file
1.0.23.10605
PS C:\Temp>     $startVersion = (Get-AksHciVersion)
VERBOSE: [06/25/2024 13:55:18] [AksHci] Initializing environment
VERBOSE: [06/25/2024 13:55:18] [AksHci] Importing Configuration
VERBOSE: [06/25/2024 13:55:18] [Moc] Importing Configuration
VERBOSE: [06/25/2024 13:55:18] [Moc] Importing Configuration Completed
VERBOSE: [06/25/2024 13:55:18] [Moc] Validating configuration
VERBOSE: [06/25/2024 13:55:18] [Moc] Get MOC Configuration
VERBOSE: [06/25/2024 13:55:18] [Moc]   Installation state is: NotInstalled
VERBOSE: [06/25/2024 13:55:19] [Kva] Importing Configuration
VERBOSE: [06/25/2024 13:55:19] [Kva] Importing Configuration Completed
VERBOSE: [06/25/2024 13:55:19] [Kva] Getting configuration for Kva
VERBOSE: [06/25/2024 13:55:19] [Kva]   Installation state is: NotInstalled
VERBOSE: [06/25/2024 13:55:19] [AksHci] Importing Configuration Completed
VERBOSE: [06/25/2024 13:55:19] [AksHci] Saving Configuration for Module AksHci to configuration file
VERBOSE: [06/25/2024 13:55:19] [Moc] Discovering configuration
VERBOSE: [06/25/2024 13:55:19] [Moc] Importing Configuration
VERBOSE: [06/25/2024 13:55:19] [Moc] Importing Configuration Completed
VERBOSE: [06/25/2024 13:55:19] [Moc] Validating configuration
VERBOSE: [06/25/2024 13:55:19] [Moc] Applying configuration
VERBOSE: [06/25/2024 13:55:20] [Moc] Saving Configuration for Module Moc to configuration file
VERBOSE: [06/25/2024 13:55:20] [Moc] Saving Configuration for Module Moc to configuration file
VERBOSE: [06/25/2024 13:55:20] [AksHci] Uninitializing environment
VERBOSE: [06/25/2024 13:55:20] [AksHci] Saving Configuration for Module AksHci to configuration file
PS C:\Temp>
PS C:\Temp>     $updateReleases = Get-ProductReleasesUptoVersion -Version $startVersion -moduleName $moduleName
Get-ProductReleasesUptoVersion : The term 'Get-ProductReleasesUptoVersion' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path
was included, verify that the path is correct and try again.
At line:1 char:23
+     $updateReleases = Get-ProductReleasesUptoVersion -Version $startV ...
+                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Get-ProductReleasesUptoVersion:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException
Elektronenvolt commented 2 days ago

At New-AksHciNetworkSetting I usually add settings for -ipAddressPrefix, -gateway, -dnsServers like its documented here: https://learn.microsoft.com/en-us/azure/aks/hybrid/kubernetes-walkthrough-powershell#step-2-create-a-virtual-network Otherwise I don't have any connectivity to other networks or Internet. May that's the missing thing to get it working.

OEI-Cgray commented 2 days ago

@Elektronenvolt that's for a static IP setup. The existing powershell process that's failing the download is running using the OS's current DNS servers. These servers do have unrestricted outgoing internet access, and I've checked the firewall to ensure it's not blocking anything due to any of it's internal threat protection mechanisms.

https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/new-akshcinetworksetting#deploy-with-a-dhcp-environment-and-a-vlan

Elektronenvolt commented 1 day ago

Ok, I'll try the offline mode on one of my test setups - can combine it with a few other topics. But, try offline and offsite. https://learn.microsoft.com/en-us/azure/aks/hybrid/offline-download-22h2#use-offline-download-to-install-offsite You can download the images somewhere else and then copy the image to the stagingshare. It would be interesting if this works.