Galaxy / Kubernetes / HTCondor implementation on Azure

rc-ms commented 7 years ago

I'd like to use this thread to discuss and present issues around implementing the AnswerALS workflow on Azure using HTCondor and Kubernetes and the Azure Container Service.

The current implementation breaks into two core nodes: storage and compute. We are having issues enabling multiple nodes to be spun up with jobs brokered between them by HTCondor.

The ultimate goal of this implementation is presented in this doc.

bgruening commented 7 years ago

@rc-ms can you tell me how you run the containers, which containers and a few more details to reproduce it?

rc-ms commented 7 years ago

hi @bgruening apparently @abhi2cool got it running. I'm hoping he'll publich details of our setup here soon

bgruening commented 7 years ago

@rc-ms thats good to know. My setup is also working as expected. Let me know if we need to change something.

rc-ms commented 6 years ago

Hello there.

Abhik has published a Helm Chart version of our Galaxy cluster. We have run on Azure but want to confirm others can as well. Here's the Helm Chart:

https://github.com/abhi2cool/galaxy-kubernetes-htc-condor/tree/helmv1/helm/galaxy

If you have an Azure account and have installed the azure-cli tools enter the following in a terminal:

az acs create --orchestrator-type kubernetes --resource-group [your-resource-group] --name [k8s-cluster-name] --agent-count 1 --generate-ssh-keys

You should be automatically connected to the cluster after creation, but you can also connect to a pre-existing cluster using the following command: az acs kubernetes get-credentials --resource-group [your-resource-group] --name [your-k8s-cluster]

pcm32 commented 6 years ago

Hi @rc-ms and @abhi2cool, did you try using first the galaxy-stable helm chart available on this other branch? Since most of the files in abhi2cool's are copies of our original chart (and is not work from scratch), I don't think it makes sense to diverge here. I don't personally see the point of deploying a Condor cluster on top of Kubernetes when the last can do the scheduling, but he can add it to that branch with the appropriate conditionals if he really needs it. @abhi2cool in this case is using files derived from my chart galaxy 0.3.3, which is a few months outdated (for instace it doesn't include the latest rbac fixes among others).

There is in particular documentation here. I'm currently making sure it works with my latest additions to the Galaxy-k8s runner which handles memory and CPU constraints for jobs, this is essential for a production grade running of Galaxy in Kubernetes (otherwise you have serious chances of choking your cluster).

rc-ms commented 6 years ago

Hi Pablo thanks for the prompt response. We went with HTCondor since we were advised to. @afgane and @nuwang can you advise here? Also @abhi2cool pushed those files by mistake. I think he's updating it and will also make it a PR.

nuwang commented 6 years ago

@pcm32 @rc-ms The overall plan was to have @abhi2cool's work be a PR against the work you've done Pablo, and be made part of the primary Galaxy Helm Chart at: https://github.com/galaxyproject/galaxy-kubernetes. The plan was for the PR to include a few enhancements, among other things, support for proftpd and HT-Condor. The reason we thought that it would be good to have HT-Condor as an option was to allow user-choice and to hedge our bets till the Kubernetes Job Runner becomes more mature. The original plan was to add support for SLURM too, as we've used it extensively, but we then fell-back to HTCondor since that's a bit more cloud friendly when it comes to autoscaling etc. At chart install time, the desired job runner would ideally be a configurable option (and we are still hoping SLURM would be an option too). This will allow for a gradual transition to the Kubernetes job runner in the long term, but more familiar and "battle-tested" job runners in the short term. Once Abhik's PR is ready, it would be great if you could review it and make any suggestions for enhancements etc.

pcm32 commented 6 years ago

@nuwang in that case, @abhi2cool should be looking at the branch (https://github.com/galaxyproject/galaxy-kubernetes/tree/feature/sync_with_galaxy_stable/galaxy-stable) I mentioned, where all the work for a non-phenomenal container support has been carried on. There is already there support for proftpd, and I'm more than happy to review a PR where Condor support is added with the adequate conditionals, and includes a number of improvements over 0.3.3 (I bumped that series to 0.4.x as many things changed, including helm variable nomenclature).

So please @abhi2cool, work on branch sync_with_galaxy_stable on the galaxy-stable chart (and not plain galaxy), ideally through pull requests so that I can review commits. Thanks!

pcm32 commented 6 years ago

Regarding maturity of Kubernetes (and runner), that is of course for each to consider on their own use case. What I can say is that we use it in PhenoMeNal as the job scheduler/dispatcher for more than a year now, and increasingly in high load scenarios. But I don't oppose in any way having an optional setup for spanning Condor containers inside k8s. I just didn't add it to the mentioned branch as we (as in PhenoMeNal) don't need it.

abhi2cool commented 6 years ago

Hi all The Helm Chart version of the Galaxy Cluster is available at https://github.com/abhi2cool/galaxy-kubernetes-htc-condor This work is basically an attempt to replicate the Galaxy Docker-compose/Swarm implementation on Kubernetes via Azure wherein we are using the identical containers and images (we went the Ht-condor way, so no slurm :P) And also we are not using PVCs, but we have dedicated a particular node for storage and are using the local file system of this particular node for all intends and purposes The cluster comes with totally scalable HT-condor worker nodes which have been implemented through a replication controller I would really appreciate it if you guys could go through this work and post your valuable feedback Thanks

rc-ms commented 6 years ago

Hola. I wanted to respond with an update and ideally some context as to how we ended up in a different place with our version @pcm32 and @bgruening . As written and referenced by @afgane and @nuwang, this doc outlines the core deliverables and implementation direction.

In addition, we wanted to set up a solution that enabled the following:

60 TB of attached storage administrable from the Galaxy server for jobs
Full FTP support from within the Galaxy interface
HTCondor-based implementation (per recommendations/guidance from team)
Dynamically scalable jobs across multiple nodes
Helm Chart based implementation (instead of SLURM)
Implementation and testing using AnswerALS AtaqSEQ and RNASEQ reference datasets and workflows

The biggest challenge was storage, specifically storage that could be accessed from Galaxy through some kind of POSIX filesystem that would both scale to 60 TB of addressable storage but also perform. When the work started an FTP solution wasn't available (that we were aware of), so we ended up building our own FTP implementation. We cycled through a variety of options, eventually deciding to mount (? somehow attach) NFS volumes directly to the clusters in some post-configuration steps outside of a Helm chart.

My hope is that we haven't strayed too far afield from your work, such that we can find an easy way to back our changes into your repo.

Thanks again,

rc

rc-ms commented 6 years ago

hello Pablo @pcm32 I'm in the process of trying to get both @abhi2cool and your implementations running, sadly with limited success with each. I was able to I think successfully launch my cluster and deploy your chart but can't access the galaxy instance. could you elaborate on this section of your doc? https://github.com/galaxyproject/galaxy-kubernetes#sqlite-local-deploy-on-minikube. I couldn't get the direct url to load (you mention (normally 192.168.99.100), port 30700.) You can see in the output below that those ports are showing in the statuses but I can't get it to render in my browser. any thoughts? thank you sir!

`rc-cola:k8sGalaxy rc$ helm status coiled-newt LAST DEPLOYED: Tue Jan 23 11:57:24 2018 NAMESPACE: default STATUS: DEPLOYED

RESOURCES: ==> v1/PersistentVolume NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE galaxy-pv 20Gi RWX Retain Bound default/galaxy-pvc 7h

==> v1/PersistentVolumeClaim NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE galaxy-pvc Bound galaxy-pv 20Gi RWX 7h

==> v1/Service NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE galaxy-svc-k8s 10.0.115.151 8080:30700/TCP 7h

==> v1/ReplicationController NAME DESIRED CURRENT READY AGE galaxy-k8s 1 1 1 7h

i ran these commands after checking the browser urls. still no dice

rc-cola:Projects rc$ kubectl port-forward galaxy 8080:30700 Error from server (NotFound): pods "galaxy" not found rc-cola:Projects rc$ kubectl port-forward galaxy-svc-k8s 8080:30700 Error from server (NotFound): pods "galaxy-svc-k8s" not found rc-cola:Projects rc$

`

pcm32 commented 6 years ago

Hi @rc-ms you are probably using the branch that is supposed to work only with Galaxy containers that are "like" the PhenoMeNal one. Last time I checked @abhi2cool, it was based on that branch as well. You essentially have two choices here:

Try the other branch I mentioned above. That means checking out and doing helm install ./galaxy-stable with a proper config file.
Make a temporal Galaxy container that looks like PhenoMeNal's (https://github.com/phnmnl/container-galaxy-k8s-runtime/blob/develop/Dockerfile) to be able to use the branch you are currently using. Then you would set that new image when using that branch of helm chart.

Unfortunately I'm super busy this week with our release process, a hackathon and a family member at hospital, but will try my best to resume work on the newer branch which can use standard Galaxy containers next week.

pcm32 commented 6 years ago

You shouldn't need to do those port exposures... and kubernetes nodes won't normally allow you to access 8080... there is only a range allowed (where 30700 is within).

rc-ms commented 6 years ago

thanks Pablo. I'm going to take a stab at your suggestions (fyi i was trying to deploy your containers). My efforts to date showed me successfully installing things from a deployment perspective but unable to connect to the galaxy instance itself (I did get the nginx homepage though). have you deployed these containers on any cloud providers or are you using your own local infrastructure?

pcm32 commented 6 years ago

Can you paste the helm install command that you're using? Thanks!

pcm32 commented 6 years ago

We deploy PhenoMeNal with the galaxy chart (not galaxy-stable) on GCE, AWS, Azure and in various OpenStacks instances (EBI, de.NBI, Uppsala, etc). The helm chart should be in any case (and I think it is) completely cloud provider agnostic. The only requirement currently is a Kubernetes cluster (probably above 1.4 or so; haven't tested 1.9, but I see no reason for it not to work) and shared file system that that k8s cluster can access (how is a decision that is independent of the helm chart, which only expects to find a persistent volume).

pcm32 commented 6 years ago

@rc-ms just to update you, I have just tested things yesterday (https://github.com/galaxyproject/galaxy-kubernetes/tree/feature/sync_with_galaxy_stable/galaxy-stable) and the deployment is working fine. I have improved a few things, probably need to make sure that the documentation has everything, this is probably the main missing bit.

nuwang commented 4 years ago

I'll close this since it's stale now.

galaxyproject / galaxy-helm

Galaxy / Kubernetes / HTCondor implementation on Azure #4