Scalability of CLARIAH tools & infrastructure

We had a meeting between WP3 and WP6 today about certain use cases where a high(er) degree of scalability is needed; specifically the need to invoke certain processing tasks in parallel so the output can be obtained in a more reasonable time.

As this is of course a central theme in any large infrastructure, I wanted to open up this issue to track any progress, solutions and discussion on this, from a generic perspective.

There are different aspects to the need to scale that we need to distinguish:

Multithreading, parallel execution to use multiple cores (this is a software matter rather than an infrastructure matter). This comes down to efficient software design that fits contemporary hardware, but is definitely not trivial. Aside from CPUs, the role of GPUs should also be considered here.
Distributed computing, i.e. the ability for a single user to dedicate parallel computing resources working together towards a single task, reducing the time in which it runs and results can be obtained. Here we an also distinguishing parallellisation on a single computing node (multiple processes) vs distribution over a larger computing cluster. The common solution here is to partition the input into n splits (if feasible of course) and run one process for each.
Concurrency: Scaling up deployments when there are more users at the same time (which is what our Infrastructure Requirements covers in point 23), and scaling down deployments as users shrink again

For 1 we need robust software design (and algorithmic design in particular). This is something we need to encourage if the problem can be solved on this level., For 3 we need load balancing and container orchestration, which should be handled by the infrastructure and is viable with solutions like kubernetes. Point 2 is typically addressed in high performance clusters using job schedulers like SLURM or complete workflow management solutions (e.g. DANE, Nextflow, Airflow, Luigi, etc). Solutions like kube-scheduler may also be fitting for our service-oriented architecture.

These three are not mutually exclusive, in real situations there may be demands for all three, also at the same time (which complicates matters)

Any views on this or ongoing efforts that address this?

@mmisworking: to what extent is work being done on 2 and 3 currently in the KNAW HuC kubernetes cluster for CLARIAH? I think 3 is probably the 'lowest hanging fruit' or 'most minimal viable solution'.
There may be certain CLARIAH software that forms a performance bottleneck for certain use-cases now and may require extra attention, WP6 identified Alpino to be one such a tool.

(Poking all participants in the WP3/WP6 meeting (who are on github): @JanOdijk @jorisvanzundert @karinavdo @JuliaNeugarten)

CLARIAH / clariah-plus

Scalability of CLARIAH tools & infrastructure #126