Open innovate-invent opened 3 years ago
Replacement of loc files, while a different goal is stated for the PR it does actually aid in solving the problem described here: https://github.com/galaxyproject/galaxy/pull/9875
This is not fully where this issue hopes it will get to, as it would depend on the implementation of people maintaining the data manager, it does set the first steps to accomplish this.
Despite popular opinion, Galaxy is not yet compatible with Cloud infrastructure. I have spent several months working to get Galaxy in the Cloud but have encountered some major barriers. Discussion at the last dev round table about work offloading and message queues has prompted me to compile all of the issues preventing Cloud deployment.
I think there are subtly varying definitions of "Cloud infrastructure" floating around so I should begin by defining what I mean when I use that term. Cloud resources are ephemeral and not necessarily co-located. For an application to be Cloud compatible it needs to access its resources in a way that is permissive of high latency and is tolerant of racing access patterns. The application processes need to be able to losslessly accept being terminated with little to no warning and replicate processes need to be able to recover the work. I need to point out the distinction between Cloud infrastructure and deploying a fixed compute cluster in a "Cloud provider" such as AWS. Even if the cluster can autoscale, nodes are expected to be long living and this does not meet my definition of a Cloud infrastructure.
Aspects of Galaxy do not meet these requirements:
Dependency on a NFS - While object store support has been added for user data, there are still many resources that require all replicates to share a network file system. NFS is expected to support a range of access patterns that do not scale well or permit high latency. Replicating a NFS across data centers is a massive undertaking and usually requires unwelcome compromise. NFS is also strongly tied to the mounting devices permission domain. This requires using fixed uid/gids for all processes when not all hosts share the same domain. Job data, tool data, managed configuration, and data table loc files all have to be shared between application replicates via an NFS. Bridging the filesystem to a more Cloud compatible data store using something such as s3fs is a layer of complexity and prone to failure when the resource is accessed assuming a full filesystem feature set. Pulsar attempts to remedy the issue with the job data, and I am still working on evaluating it as a solution. I have yet to confirm if it supports process replication and scaling. It also appears that support for removing the need for a NFS between the Pulsar daemon and the jobs is still in its infancy and needs to be reworked.
Intolerant of process termination - Currently in 20.09, the Galaxy app process (apart from the workflow scheduler and job scheduler) attempts to do large operations such as history deletion "in process". A sudden termination of the replicate serving the request will lose the request to delete the history leaving a partially deleted history. Schedulers attempt to synchronize their work by acquiring a lock on a database resource. Unfortunately there is no mechanism to release this lock when a worker terminates. All tasks acquired by that scheduler are orphaned and need to be manually killed. These limitations prevent Galaxy from being auto-scalable.
CPU bound auto-scaling - This is the issue I was discussing with @dannon in the Gitter channel. Galaxy maintains internal queues opaque to the infrastructure. This means the only way to trigger scaling of resources is to watch the CPU load of the processes serving the queues or counting the number of incoming web requests. These metrics do not allow preemptive scaling of resources. There is also no way to inform the infrastructure how much to scale. A queue can have thousands of microsecond tasks or a few very large ones. The infrastructure needs to be able to understand the weighted volume of pending work in order to scale appropriately. The job queue is able to do this because it leverages the queue provided by the underlying infrastructure. Ideally all work tasks should have to pass through a underlying infrastructure interface to be executed.
Locked to quay.io - Currently there is no way to implement container caching. Galaxy is hardcoded to quay.io/biocontainers to resolve its dependencies. I attempted to spoof quay.io via a pull through proxy but Galaxy still uses v1 of the registry API which made this impossible. Compute nodes scale to zero when no work is pending, meaning that every time a large job comes in, every spawned compute node has to re-download the tool containers from quay.io. This is very problematic when serving very bursty work requests to over a hundred new compute nodes (which are then deleted when the queue drains).
Proposals
Stretch goals
References: https://github.com/galaxyproject/galaxy/issues/10243 https://github.com/galaxyproject/galaxy/issues/10894 https://github.com/galaxyproject/galaxy/issues/10699 https://github.com/galaxyproject/galaxy/issues/10686 https://github.com/galaxyproject/galaxy/issues/10646 https://github.com/galaxyproject/galaxy/issues/10576 https://github.com/galaxyproject/galaxy/issues/10536 https://github.com/galaxyproject/galaxy/issues/10414 https://github.com/galaxyproject/galaxy/issues/8392 https://github.com/galaxyproject/galaxy/issues/11334 https://github.com/galaxyproject/galaxy/issues/10894 https://github.com/galaxyproject/galaxy/pull/10436 https://github.com/galaxyproject/galaxy/issues/11721