Kitware / HPCCloud

A Cloud/Web-Based Simulation Environment
https://kitware.github.io/HPCCloud/
Apache License 2.0
50 stars 23 forks source link

LSF scheduler integration #622

Open robertsawko opened 6 years ago

robertsawko commented 6 years ago

Hi,

My colleague and I are working on LSF integration. We have created and adapted relevant files: lsf.py and lsf.sh and LSF is displaying as "Scheduler" in cluster settings. Unfortunately we are still getting "Unsupported scheduler" when we press "Save" button.

Looking at the code we can see that queue in cluster.py still doesn't find the type even though we added relevant content in dictionary type_to_adapter in queue/__init__.py.

Any advice would welcome, thanks.

carpemonf-zz commented 6 years ago

Only add that we also modified the constants.py file for the cumulus plugin, and the hpccloud files (LSF.js, index.js and RunCluster.js) to see the LSF scheduler.

cjh1 commented 6 years ago

The message will be produced if this is False. Are you using the same string to identify the queue as the one registered?

robertsawko commented 6 years ago

Can we just double check with you: where do we put the string to identify the queue and and wehere do we register it?

cjh1 commented 6 years ago

Its the key used when you add to the type_to_adapter dict.

carpemonf-zz commented 6 years ago

I think we have the same string for the LSF queue. We have the following definition for /opt/hpccloud/cumulus/cumulus/constants.py:

class QueueType:
    SGE = 'sge'
    PBS = 'pbs'
    SLURM = 'slurm'
    LSF = 'lsf'
    NEWT = 'newt'

and for /opt/hpccloud/hpccloud/src/panels/SchedulerConfig/index.js:

            <option value="sge">Sun Grid Engin</option>
            <option value="pbs">PBS</option>
            <option value="slurm">SLURM</option>
            <option value="lsf">LSF</option>
cjh1 commented 6 years ago

@carpemonf In that case are you sure your are running the updated code? May be add a quick print statement to ensure your changes are being loaded on the server.

robertsawko commented 6 years ago

@cjh1 , sorry we are a bit new to Girder and python->WebUI conversion. Could you tell us how to print or or that data and where to find the standard output or log file? Thanks.

carpemonf-zz commented 6 years ago

Thanks, I have just added a modified statement through ValidationException message in /opt/hpccloud/cumulus/girder/cumulus/server/models/cluster.py. It also includes, for testing, a message when the queue is supported.

        if not queue.is_valid_type(scheduler_type):
            raise ValidationException('Unsupported scheduler: %s.' % scheduler_type, 'type')
        else:
            raise ValidationException('Supported scheduler: %s.' % scheduler_type, 'type')

The new messages appear, so at least this piece of code is updated.

Next, I modified the is_valid_type(type) function in /opt/hpccloud/cumulus/cumulus/queue/__init__.py to fix the return value to False to see if the function was checking the LSF queue properly, but It always give me a True statement for the default queues.

    valid = False
    return valid
    #return type in type_to_adapter
robertsawko commented 6 years ago

Sorry for bumping this thread, but we are still struggling with this! So what we did right now was to comment out the exception throwing and we actually managed to move forward. This kind of shows that the code is compiling but maybe not all of it? Any advice would be welcome.

cjh1 commented 6 years ago

@robertsawko Sorry I have away a conference. Will try to take a look today.

cjh1 commented 6 years ago

@robertsawko Are your code changes pushed somewhere so I could take look?

carpemonf-zz commented 6 years ago

@cjh1 We had some problems with the cumulus dependencies compiling HPCCloud for development, we are playing in the meanwhile with the prebuilt VMs on HPCCloud-deploy/prebuilt/hpccloud-server/. I have attached a patch: lsf.txt. This should work for this VM.

Please let me know if it works for you.

cjh1 commented 6 years ago

So I was able take your server changes ( the new adapter ) and apply it locally and was then able to create a LSF cluster using following POST ( outside the web app ):

{
  "config": {
    "scheduler": {"type": "lsf"},
    "ssh": {"user": "test"},
"host": "test"
  },
  "name": "test3",
  "type": "trad"
}

@jourdain Can you check that client side is constructing the appropriate JSON.

carpemonf-zz commented 6 years ago

@cjh1 Could you please confirm me that applying the following to cumulus/cumulus/queue/__init__.py, a cluster with PBS, SGE, SLURM or LSF can not be created? I can still create them with these changes:

-    return type in type_to_adapter
+    valid = False
+    return valid
+    #return type in type_to_adapter
cjh1 commented 6 years ago

@carpemonf If I return False from is_valid_type, I am unable to create a cluster, I get "Unsupported scheduler." as expected.

carpemonf-zz commented 6 years ago

Thank you @cjh1, I probably did something wrong then.

robertsawko commented 5 years ago

We're back to working on this topic and we have now our own ansible deployment with lsf integration added as patch. When we confirm it's working as expected I will make sure we can share it with you, but for now I would like to ask you a question without showing the full patch.

It seems that in the script lsf.sh which we modelled on slurm.sh some variables are not passed at all e.g. queue name and number of slots. Whatever value we set in the web UI, the script from cumulus generates no queue line or fixes #BSUB -n (no of slots) to 1.

Could you perhaps point us to the right place in the code where we may have missed something? Which file/component is responsible for extracting these variables from the web forms?

cjh1 commented 5 years ago

@robertsawko I am not super famliar with the frontend code, but I think place you need to look is here. @jourdain can correct me if I am wrong :smile:

jourdain commented 5 years ago

Yes you right and make sure you register it in the index.js.

carpemonf-zz commented 5 years ago

Hi @jourdain @cjh1

Sorry for bringing this back up again, but we are still having issues for passing variables from the web interface to the queue, for example "Number of slots" or "Max runtime". For instance, these variables are empty when the corresponding lsf.sh is executed.

Since it was failing with our custom LSF implementation, I decided to test the prebuilt compute-node VMs provided in this repo and configured with a SGE scheduler. I modified /opt/hpccloud/cumulus/cumulus/templates/schedulers/sge.sh for printing the following variables:

Taking a deeper look into the cumulus taskflows, I realised that in /opt/hpccloud/hpccloud/server/taskflows/hpccloud/taskflow/openfoam/tutorial.py the numberOfSlots seems to be hardcoded (the same for windtunnel.py):

    ## slots
    job['params']['numberOfSlots'] = 1

However, commenting this line results in an empty numberOfSlots variables. Can you please provide any help about this?

jourdain commented 5 years ago

Can you check that all the info are properly sent to the server here?

If that's the case, I'm wondering what may trim down some informations.

carpemonf-zz commented 5 years ago

Thanks @jourdain. If I check payload in that function all the parameters are right.

Coming back to the taskflows scripts tutorial.py and windtunnel.py: job['params'] = {} is initialised, but numberOfSlots is never linked with the values specified by the user in the web UI. I can see that for PyFR there's something like this:

number_of_procs = kwargs.get('numberOfSlots')

Doing the same for tutorial.py or windtunnel.py worked for me and allows numberOfSlots being visible by the queue script:

-job['params']['numberOfSlots'] = 1
+job['params']['numberOfSlots'] = kwargs.get('numberOfSlots')

I still have the problem for the wall time. Does the same apply for maxWallTime? Should it be in job['params']?

jourdain commented 5 years ago

On the Python side, what else do you have in the kwargs?

carpemonf-zz commented 5 years ago

I'm using the default prebuilt VMs for HPCCloud server and compute-node. Iterating over kwargs, I get:

[17:26:37.973] INFO: numberOfGpusPerNode
[17:26:37.988] INFO: 0
[17:26:37.994] INFO: numberOfSlots
[17:26:38.000] INFO: 1
[17:26:38.008] INFO: image_spec
[17:26:38.014] INFO: {
  "owner": "695977956746",
  "tags": {
    "openfoam": "1612"
  }
}
[17:26:38.021] INFO: next
[17:26:38.027] INFO: {
  "args": [],
  "chord_size": null,
  "immutable": false,
  "kwargs": {},
  "options": {},
  "subtask_type": null,
  "task": "hpccloud.taskflow.openfoam.tutorial.create_openfoam_job"
}
[17:26:38.035] INFO: queue
[17:26:38.042] INFO: 
[17:26:38.049] INFO: maxWallTime
[17:26:38.055] INFO: {
  "hours": "1",
  "minutes": 0,
  "seconds": 0
}
[17:26:38.062] INFO: input

[17:26:38.068] INFO: {
  "folder": {
    "id": "5cc08e170640fd00e5065cd7"
  },
  "shFile": {
    "id": "5cc08e320640fd00e5065cf4"
  }
}
[17:26:38.075] INFO: output
[17:26:38.083] INFO: {
  "folder": {
    "id": "5cc08e170640fd00e5065cd6"
  }
}

So, does this need something like job['params']['maxWallTime'] = kwargs.get('maxWallTime')?

jourdain commented 5 years ago

Yes that was my thoughts. All the info get passed to the kwargs and it's up to the job to pick them up and attached them to the job params.