Tool design - Githubissues

Objective

Speed up the processing of CLI-based pipelines (use case: neuroimaging) running on HPC by incorporating Big Data strategies (in-memory + data locality) to them in the background.

Why not use existing Big Data Engines?

Many common HPC scheduler plugins do not natively work with existing Big Data Engines (e.g. Apache Spark), and as a result, there is no straightforward way to execute the pipelines on HPC clusters (must request all necessary resources at once and know the exact duration of the pipeline).
CLIs cannot interface with in-memory objects created by BDEs, therefore it is necessary to read and write the data at each pipeline step on a filesystem accessible to the CLI. As most, if not all, data is required by the user after completion, this data must be stored on a shared filesystem such as Lustre, further degrading engine performance
Filesystems for BigData engines do exist (e.g. HDFS, Alluxio), but existing CLIs would need to be modified to leverage them. Moreover, these filesystems need to exist only for the duration of the pipeline execution, further degrading performance due to potential overheads.

Design

**Note: will be python-based for now

File System with Hierarchical storage layers

Be a Linux-based filesystem that exists in userspace (i.e. FUSE) such that any user can mount it at runtime
Be tiered such that any created file goes through a hierarchy of storage devices (e.g. tmpfs, ssd, hdd, Lustre) and is placed in the fastest storage that has enough available space.
Use asynchronous eviction policies in order to free up space at the top of the hierarchy for other files. Eviction policy may be "pipeline-aware" such that it will only evict to Lustre if the file will never be required again (example when UDFs are included in pipeline code).
Avoid out-of-memory errors at the top of the hierarchy by moving contents further down
If calls are made to files that do not currently exist in it but are found in the pipeline, proceed with execution of the pipeline
Ensure that all files are written to slowest-storage (i.e. Lustre) at pipeline termination
Copy then remove when evicting and use a temporary filename during the copy process. Compare checksums to ensure file was successfully copied.
Render all written files as read-only
Keeps track of which files are located locally on a given node in a file on the shared filesystem
Possible limitations:
Filesystem exists in userspace making it slower than a kernel space
Copies will potentially slow down performance
Possible limit on the number of files that can be created within a node
Related work:
https://link.springer.com/article/10.1631/FITEE.1700626 (Doesn't appear to be open-source)
Scheduler
Starts a cluster (using the hpc scheduler) based on user requirements
Maintains a graph of tasks
Tasks which are ready to be executed (i.e. dependencies are met) are sent to the node where the (or most of the) data is available locally. If no data is available locally, can be any arbitrary node. If node with data is busy for a certain amount of time, execute on another node (scp data over?)
Tasks will be added to a local folder on node. Script on each node will just go through and execute each file on the folder, deleting them once the task successfully completes.
Cluster will terminate at walltime expiration or when there are no more tasks to execute
Cluster should be allowed to expand when additional nodes are added
User should also be able to terminate cluster if desired. This should trigger an automatic flushing to Lustre as with all other forms of termination.
Should pipeline need to be re-executed, scheduler should no reschedule tasks that have already produced desired output (i.e. verify file checksums)
Possible limitations:
This might not work very well, particularly if there are a large number of tasks and a number of file limitation
Interface with existing pipeline engines
Will accept a graph directly or intercept calls to HPC scheduler. This will allow users to build workflows as they're used and benefit from improved performance
For intercepting calls, the tool must be started prior to script execution, otherwise it will not know to intercept calls
Pipeline execution should be started either through the tool's CLI or within the user's pipeline code with a command like compute().
User defined functions that access pipeline produced files will also trigger computation.

ValHayot / sea_fuse

Tool design #1

Objective

Why not use existing Big Data Engines?

Design

File System with Hierarchical storage layers

Possible limitations:

Related work:

Scheduler

Possible limitations:

Interface with existing pipeline engines