Azure / cyclecloud-hpcpack

CycleCloud project to enable use of the Microsoft HPC Pack job scheduler in Azure CycleCloud HPC clusters.
MIT License
4 stars 8 forks source link

Nodes not being created, and not able to submit jobs from web portal #22

Open jpROC1 opened 8 months ago

jpROC1 commented 8 months ago

I have deployed a fresh cyclecloud 8.6 machine and used the built in template to deploy a Windows based headnode. The headnode comes up successfully however the node is listed as "offline". I manually turn the node to online and submit a job from the job manager but a node never gets spun up in cyclecloud.

The web portal has all the options for submitting jobs greyed out.

The only error I have seen is when trying to use the hpcpack cli and it has an error with not being able to find a python module for HPCPACK autoscale.

coin8086 commented 8 months ago

Hi @jpROC1, what web portal you were using and failed summitting jobs with? And what's the complete command line you used and what's the complete error message? Could you provide a clearer description for your questions? Optionally with snapshots of your portal/error etc. It seems to me that you created a cluster with only one head node and you're expecting the cycle cloud auto-scaling helps you grow your HPC Pack cluster, right? Then have you enabled that option by cycle cloud when you were creating the HPC Pack cluster? Could you provide the options you used to create the HPC Pack cluster?

jpROC1 commented 8 months ago

Hi @coin8086, we deployed a CycleCloud 8.6 machine and then used the built in template to get a HPC Pack Cluster, using the latest version of HPC pack. I was accessing both the head node through RDP and using the job manager software and accessing the web portal from a local machine. We did enable auto scaling, and I just submitted a job through the job manager that just ran dir on a target machine.

The web portal had all the options for the job submission greyed out

I have torn the cluster back down, but will put it back up to get some screenshots.

bvandenbogaard commented 5 months ago

I am having a similar experience, it seems that the scheduled task is not running at all because of a missing module error:

python.exe : C:\cycle\hpcpack-autoscaler\.venvs\cyclecloud-hpcpack\Scripts\python.exe: Error while
finding module specification for 'cyclecloud-hpcpack.cli' (ModuleNotFoundError: No module named
'cyclecloud-hpcpack')
At C:\cycle\hpcpack-autoscaler\bin\azhpcpack.ps1:6 char:1
+ & python -m cyclecloud-hpcpack.cli @args *> c:\output.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (C:\cycle\hpcpac...cloud-hpcpack'):String) [], RemoteE
   xception
    + FullyQualifiedErrorId : NativeCommandError

I got this info after changing the PowerShell script line calling Python into this:

& python -m cyclecloud-hpcpack.cli @args *> c:\output.txt

I have enabled autoscaling when creating the HPC Pack cluster through CycleCloud.

bvandenbogaard commented 5 months ago

After some digging on the HPC Pack headnode I found this:

PS C:\cycle\jetpack\system\bootstrap\hpcpack-autoscaler-installer> .\install.ps1

    Directory: C:\cycle

Mode                LastWriteTime         Length Name                                                                                                                                         
----                -------------         ------ ----                                                                                                                                         
d-----        5/16/2024   1:07 PM                hpcpack-autoscaler                                                                                                                           

    Directory: C:\cycle\hpcpack-autoscaler

Mode                LastWriteTime         Length Name                                                                                                                                         
----                -------------         ------ ----                                                                                                                                         
d-----        5/16/2024   1:07 PM                bin                                                                                                                                          
Requirement already satisfied: pip in c:\cycle\hpcpack-autoscaler\.venvs\cyclecloud-hpcpack\lib\site-packages (24.0)
Processing c:\cycle\jetpack\system\bootstrap\hpcpack-autoscaler-installer\packages\argcomplete-1.12.2-py2.py3-none-any.whl
Processing c:\cycle\jetpack\system\bootstrap\hpcpack-autoscaler-installer\packages\certifi-2020.12.5-py2.py3-none-any.whl
Processing c:\cycle\jetpack\system\bootstrap\hpcpack-autoscaler-installer\packages\chardet-5.2.0-py3-none-any.whl
pip.exe : ERROR: charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is not a supported wheel on this platform.
At C:\cycle\jetpack\system\bootstrap\hpcpack-autoscaler-installer\install.ps1:27 char:1
+ & pip install -U (get-item $PSScriptRoot\packages\*)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (ERROR: charset_... this platform.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

Generating config at : C:\cycle\jetpack\config\autoscale.json
python.exe : C:\cycle\hpcpack-autoscaler\.venvs\cyclecloud-hpcpack\Scripts\python.exe: Error while finding module specification for 'cyclecloud-hpcpack.cli' (ModuleNotFoundError: No module 
named 'cyclecloud-hpcpack')
At C:\cycle\hpcpack-autoscaler\bin\azhpcpack.ps1:7 char:1
+ & python -m cyclecloud-hpcpack.cli @args
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (C:\cycle\hpcpac...cloud-hpcpack'):String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

It seems that the hpcpack-autoscaler-installer is trying to install a Linux whl file made for Python 3.10 on the Windows HPC Pack head node host running Python 3.8.8.

After replacing the whl file by hand the installer script works correctly. I used the whl from https://files.pythonhosted.org/packages/db/fb/d29e343e7c57bbf1231275939f6e75eb740cd47a9d7cb2c52ffeb62ef869/charset_normalizer-3.3.2-cp38-cp38-win_amd64.whl to verify this.