Kitware / HPCCloud-deploy

VM Deploy for HPC-Cloud
Apache License 2.0
18 stars 4 forks source link

Create vagrant file for compute node #88

Closed jourdain closed 7 years ago

jourdain commented 7 years ago

WIP

jourdain commented 7 years ago

I guess that could be a good start

jourdain commented 7 years ago

Running that new vagrant from scratch to make sure it works without the hpccloud user.

jourdain commented 7 years ago

Hum got...

TASK [pyfr : Install pip3] *****************************************************
task path: /Users/seb/Documents/code/HPCCloud/HPCCloud-deploy/demo/roles/pyfr/tasks/main.yml:1
failed: [hpccloud-compute-node-vm] (item=[u'python3-pip']) => {"failed": true, "item": ["python3-pip"], "msg": "Failed to lock apt for exclusive operation"}
jourdain commented 7 years ago

that role seems missing the become: yes everywhere. Should add it @cjh1 ?

cjh1 commented 7 years ago

yes please

jourdain commented 7 years ago

ok thanks... I'll look around and add become: yes when I see become_user: root in any role within ./demo/roles.

jourdain commented 7 years ago

@cjh1 can you look at my last commit. I'm kind of worried that I had to change it for lots of location. I just want to make sure that change actually make sense to you... thx

jourdain commented 7 years ago

Actually the pyfr role come from your repo here https://github.com/cjh1/pyfr-ansible-role/blob/master/tasks/main.yml

cjh1 commented 7 years ago

They look good, I will update the pyfr role

jourdain commented 7 years ago

Why don't we use the pyfr role that is in the ansible directory? I just notice that we are missing 'pycuda' and the numpy version is different.

cjh1 commented 7 years ago

The pyfr role in the ansible directly came first, it only installs enough for pyfr to be able to partition the mesh. The role I created installs the runtime as well. We can probably just use the one I created as it is more complete.

jourdain commented 7 years ago

I'm open to any solution, I thought we should try to rely on one code that are in the same repo if we can. The difference between the two files are really small.

jourdain commented 7 years ago

Seems to be working... No more error at deployment. Trying to use it as compute node now.

cjh1 commented 7 years ago

The advantage of keeping it in a separate repo is it can be used in multiple playbooks, through ansible galaxy.

jourdain commented 7 years ago

Hum wondering if something wrong happened to my girder:

[10:34:09.698] ERROR: Exception raise by task.
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "cumulus/taskflow/__init__.py", line 117, in wrapped
    return func(celery_task, *args, **kwargs)
  File "/opt/hpccloud/hpccloud/server/taskflows/hpccloud/taskflow/openfoam/tutorial.py", line 201, in upload_output
    girder_token=task.taskflow.girder_token)
  File "cumulus/tasks/job.py", line 774, in upload_job_output_to_folder
    assetstore_id = get_assetstore_id(girder_token, cluster)
  File "cumulus/transport/files/__init__.py", line 54, in get_assetstore_id
    check_status(r)
  File "cumulus/common/__init__.py", line 32, in check_status
    request.raise_for_status()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 400 Client Error: Bad Request for url: http://10.160.1.108:8080/api/v1/sftp_assetstores

Going to swagger do not show any GET endpoint. Only POST?

jourdain commented 7 years ago

I should try to pull your latest cumulus too...

jourdain commented 7 years ago

This seems fine now do you mind reviewing it? @cjh1