bsc-wdc / dislib

The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Apache License 2.0
45 stars 23 forks source link

Running on single CPU node system #438

Open vineel96 opened 1 year ago

vineel96 commented 1 year ago

Hi @cTatu , Is it possible to run dislib on single node system with 8 core cpu?(as i am getting non-reachable nodes error when running on single node) and also will the performance boost remain same?

lezzidan commented 1 year ago

Could you please send the logs and details on how/where you are running?

vineel96 commented 1 year ago

Hi @lezzidan , Hardware info:

  1. AWS c7g.4xlarge instance
  2. Architecture: ARM (aarch64)
  3. No of CPU: 1, CPU Cores: 16, No hyperthreading i.e only one thread per core
  4. RAM: 32GB Installation : pycompss=3.1, dislib=0.8.0, python=3.9 (followed installation steps mentioned in doc) Algorithm: dislib Kmeans Dataset size: 236930 x 14 Command 1: python kmeans_dislib.py observation: The program gets stuck/hangs for longer time htop command: we observe only 1 core is being used

Command 2: export ComputingUnits=8 runcompss kmeans_dislib.py observation: No task could be scheduled to any of the available resources, shutting down COMPSs

Screenshot 2023-05-11 102243

htop output: randomly some cores is getting used at different instances
Screenshot (30)

cTatu commented 1 year ago

Hi, I suspect that the default ComputingUnits in the resources.xml of COMPSs is set to only 4 cores. Try looking into this file /opt/COMPSs//Runtime/configuration/xml/resources/default_resources.xml and change <ComputingUnits>4</ComputingUnits> to 16.

Also you can also try export ComputingUnits=1. You mentioned Dataset size: 236930 x 14 but what block size are you using? Because that will determine the number of tasks that will be launched in parallel.

vineel96 commented 1 year ago

Hi @cTatu, I have changed computingunits value to 16 in default_resources.xml. The error remained same "No task could be scheduled to any of the available resources, shutting down COMPSs" or the program gets hanged for long time. Also i tried setting "export ComputingUnits=1", same issue persists. I have tried two block sizes: 1. (229616,7) and 2. (2,2) For these two block sizes the error remained same where program gets hanged or it says "no task can be scheduled, shutting down COMPs"

vineel96 commented 1 year ago

Hi @cTatu, @lezzidan, Can i get any suggestions/help regarding the issue mentioned?

cTatu commented 1 year ago

Hey sorry for the delay,

One possible thing could be that the ssh-daemon is not started. COMPSs needs ssh access to the worker node (which in your case is the same machine). So to check that try executing ssh localhost and it should be configured in a password-less way (using rsa keys). Make sure the service is installed sudo apt install openssh-server and that is on sudo service ssh start. For passworless configuration you can follow our guide: https://compss-doc.readthedocs.io/en/stable/Sections/01_Installation/05_Additional_configuration.html

Hope this works Best regards