EOEPCA / deployment-guide

EOEPCA Deployment Guide
https://deployment-guide.docs.eoepca.org/
Apache License 2.0
3 stars 7 forks source link

EOEPCA+ Infrastructure Deployment Guide should focus on pre-requisites #21

Open spinto opened 1 week ago

spinto commented 1 week ago

I appreciate the effort to put information on how to create a cluster satisfying EOEPCA+ needs, but I think what is in the Infrastructure Deployment guide is a bit too much and risks to divert the attention of people from the peculiarities of EOEPCA.

So my fear is that, one, it will be hard and a bit pointless to maintain a guide on how to install kubernetes and setup a k8s cluster when there are several on the internet (which we can point, like the rancher k8s non-production environment installation), two people may just skip that, and assume they have enough with their kubernetes cluster, while actually there are some peculiarities of EOEPCA like the need to run containers as root, the readwritemany storage, the specific storageclass names for persistance, which may get lost.

So my proposal qould be to rename the "Infrastructure Deployment" to "EOEPCA pre-requisite". Have there the ollowing sessions:

Plus, the check-prequisite script should be more "invasive" and run some tests in the cluster, e.g. running a pod as root, starting a pod service with an ingress and checking if the pod is accessible, checking if the certificate for that pod is correct, etc...

spinto commented 1 week ago

as a note from the discussions in #23 and #14 , in the pre-requisite page we should consider putting info about what is recommended for production and what is recommended for development. This is valid for all the 3 areas, the K8S cluster, the EBS storage and the Object Storage.

I would imagine that, for the K8S cluster, for production we would recommend an external IP-address, certmanager with letsencrypt and Rancher (production install), while for development/internal testing/demos we would recommend Rancher (single node install) and the manual TLS.

For the EBS, I have runned several solutions in the past, and in production IBM Spectrum Scale (proprietary) and GlusterFS (open source) works quite well, while for development Longhorn and OpenEBS are supposed to be much simpler to setup.

For Object Storage also, the EOEPCA minio helm chart is good for development/testin/demo, but probably a standalone Minio installation or something like the EMC Object Storage solution (or Amazon S3) is a better option

jdries commented 1 week ago

So you are saying, all operational platforms should operate a GlusterFS or else a proprietary solution right? (Unless, if rwx volumes are offered by cloud provider?)

spinto commented 1 week ago

No, I am not saying that. I am saying that there are several solutions which are proven to be operationally-ready, GlusterFS is one of them, but there are others. OpenEBS may be one of them, I did not use it personally but Fabrice was telling that it is used in operations in different platforms.

jdries commented 1 week ago

Thanks for all the explanation, it's already helpful! Anyway, the main concern for operational platforms is to get an idea of what operational cost will be, and how complex it is to run something like that on an autoscaling cluster and cloud environment where VM's are ephemeral. From my own experience in running a data storage cluster, it does require significantly more experience & work, but perhaps something modern like OpenEBS solves that. (Even though I hear that cloud providers themselves are also struggling or have struggled with providing rwx volumes.) The other interesting option to explore is cwl runners that in fact avoid the shared storage requirement, but again, I would hope that this has all been researched in the past.

spinto commented 1 week ago

About the cost/complexity vs advantages, I think it mostly depends on which kind of applications you want to support. CWL is mostly used in HTC/HPC , so it feels "natural" that a CWL runner would assume or be configured by default with a shared storage across your nodes... but CWL, even if born in HPC, it is just a workflow and does not require per-se distributed storage.

And yes, this was explored in the past, CWL (or OGC AppPackage, BTW) does not mean calrissian. That is what we have in one of the EOEPCA processing BB "engine", but we have already Toil as a CWL runner for another "engine", and Toil for example should not require a ReadWriteMany if configured with HTCondor as scheduler. Also, for OpenEO UDF, as the use case is not really HTC, you could just have a simple execution via cwltool . We can chat more about what it is best.

NOTE: we are digressing outside the scope of this ticket, for that, as I said before, what we need to ensure is that the documentation is clear also in addressing that the OpenEBS or other ReadWriteMany solutions is required only by some of the EOEPCA BBs (and we should specify which ones)