Open masonlr opened 6 years ago
I'd suggest we make this part of the "Job manager" configuration as this will already need to handle different communication and execution mechanisms for different compute platforms. That said, we may want to allow specification of cloud or other "off cluster" persistent storage for cluster compute platforms also. I'd suggest defining a storage interface we can then plug into a jab manager instance.
Sounds good. Testing with python Azure for the moment and plan to inject a generic storage handler into the job manager using an Azure specific example for the moment. The actual code will call the azure.storage.blob
python library, for example:
#!/usr/bin/env python
from azure.storage.blob import BlockBlobService
from azure.storage.blob import ContentSettings
# set credentials
block_blob_service = BlockBlobService(
account_name='SECRET',
account_key='SECRET')
# show existing blobs
generator = block_blob_service.list_blobs('blue')
for blob in generator:
print(blob.name)
# upload a new blob
block_blob_service.create_blob_from_path(
'blue',
'test.png',
'content.png',
content_settings=ContentSettings(content_type='image/png')
)
At least two different approaches are possible (these apply independent of whether we are using an Azure VM or a local cluster for the simulator computation):
Transfer file directly from simulator to Azure storage (would require custom python Azure module installed on VM).
con: have to manage simulator-side dependencies. (We are already doing this though, for example we require the python
vtk
module to be present, hence it is not much of a stretch to require the pythonazure
module to be present.)
Copy file to middleware (via paramiko scp) and allow middleware to relay file to Azure storage.
con: does this mean we're making two transfers? 1. simulator to middleware, 2. middleware to storage. (Middleware-to-storage transfer may be bandwidth cheap if both are hosted on Azure, but this doesn't generalise to other App services, i.e. bring your own Azure/AWS/GCE compute account mixed with bring your own Azure/AWS/GCE storage account)
I will start on making a prototype for option 1 above.
In the Case structure we could support an 'outputs' field, for example:
{
"outputs": [
{
"source_uri": "output/summary.png",
"destination_path": ".",
"content_type": "image/png",
},
{
"source_uri": "output/output_01.vtk",
"destination_path": ".",
"content_type": "application/octet-stream",
},
{
"source_uri": "output/output_02.vtk",
"destination_path": ".",
"content_type": "application/octet-stream",
}
]
}
This structure would be parsed to trigger file transfers:
import os
job_id = 'UUID'
for f in outputs:
basename = os.path.basename(f['source_uri'])
destination_path = os.path.join(f['destination_path'], basename)
block_blob_service.create_blob_from_path(
'job_id',
f['source_uri'],
destination_path,
content_settings=ContentSettings(content_type=f['content_type']))
In this way, we could assign a block blob for each job (the block blob would be labelled by the job id).
Best practice for managing access to Azure storage is to generate SAS tokens. I'd suggest we do something like the following:
az-storage.py
script in our azure-batch-tools repo. I'd prefer we used the raw Python SDK rather than the Azure CLI client, but am happy for us to just tweak the current script to use a different location for the SAS token to start with to get things working. Note that bob storage is key-value pair storage, with blob names as keys and blob contents as values. However, "/"s in blob names will act as virtual directories in the portal UI and the file explorer app.@masonlr What are your thoughts on directory structure for persistent storage? Let's assume for now that an organisation or team might assign a single account for all their simulations.
With reproducibility / verifiability in mind, would we want to store a bunch of input data on persistent storage too (or at least offer the option)? It would be quite cool if someone could copy the directory for a job and re-run it on another compute resource. The main issue I can think of would be how to ensure the same version of the simulation software runs in cases where we don't want to export the simulator to persistent storage and/or the simulator is a platform specific binary. [Edit]: Maybe doable if each run is a docker container? However, that might not play nice with proper clusters.
Yes, definitely keen to store input files. We will need to experiment with this, i.e. do we store the patched input files or do we store the templates and patches together (to allow tweaking of parameters on re-run, for example).
This structure would be parsed to trigger file transfers:
import os job_id = 'UUID' for f in outputs: basename = os.path.basename(f['source_uri']) destination_path = os.path.join(f['destination_path'], basename) block_blob_service.create_blob_from_path( 'job_id', f['source_uri'], destination_path, content_settings=ContentSettings(content_type=f['content_type']))
In this way, we could assign a block blob for each job (the block blob would be labelled by the job id).
So my reading of the above is that each job would have a container and the blob name would be a concatenation of destination_path
and source_uri
from the config file. Is this correct?
/
characters in the blob name are treated as virtual directories in the Azure web portal and Azure storage explorer app.destination_path
should really be a path (i.e. including filename) relative to the root of the storage account / container set for the job.job_id
sets the blob name here. A blob can hold many files. Files can also be added to an existing blob.
So in this model, we'd have a container for each user (?) and a blob for each job (?)
I don't think that's true. A single bob == a single file. From the azure Python SDK docs
create_blob_from_path(container_name blob_name, file_path, content_settings=None,...)
Ah sorry, multitasking and confused. Yes the file explorer shows this:
I think that destination_path should really be a path (i.e. including filename) relative to the root of the storage account / container set for the job.
On complication is if we want to support wildcards. For example, we may not know the names of the output files a priori.
So does this mean we can use virtual directories arranged by job id?
So my reading of the above is that each job would have a container and the blob name would be a concatenation of destination_path and source_uri from the config file. Is this correct?
I agree that the notation needs to change. At the moment, we're assuming that _uri
is a path (relative or full) to a file and _path
is a path to a directory. In the model above we're pulling the filename (output_01.vtk
) from the _uri
component, and then concatenating this to the _path
directory.
So in this model, we'd have a container for each user (?) and a blob for each job (?)
I'd be inclined to just let the user pick either an Account or a Container as the "root" storage location and then create blobs under these, using prefixes for lower level grouping. Azure supports unlimited containers per account, but AAWS only supports 100 buckets (AWS container equivalent) per account by default. This can be increased with a service request but we shouldn't rely on being able to create an arbitrary number of containers.
In terms of prefix patterns, given that "moving" a blob from one "virtual directory" to another by changing it's prefix is non-trivial, I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename
naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.
On complication is if we want to support wildcards. For example, we may not know the names of the output files a priori.
If the filename is in the source_uri
, then we know it. If we want to support a model where the source_uri
is source/directory/structure/*.ext
using pattern matching, we should support the same pattern matching in the destination. I can see the benefit in a "same as source_uri
" shortcut, but would suggest we copy the pattern matching used by the linux cp
or scp
commands.
suggest we copy the pattern matching used by the linux cp or scp commands.
"."
wouldn't quite mean "same as source_uri
" in the current model, it would mean put the file in the root directory (as we're using basename above to extract the filename).
Maybe duplicate the syntax for cp
with globstar
shell option. From the same link as above
If you're using Bash, you can turn on the
globstar
shell option to match files and directories recursively:shopt -s globstar cp src/**/*.so dst
I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.
Okay. If we're trying to retrieve all files associated with a job (for the reproduction idea above), can we easily query by the job-id stem in the virtual storage path?
We could extract a subset of files from a container,
# get all blobs in container
generator = block_blob_service.list_blobs('container')
for blob in generator:
# here we add logic to filter the retrieved list by job-id prefix
print(blob.name)
It's easy to filter a query for blobs by a prefix, but wildcards within the prefix aren't supported. I'm thinking something like job-id/outputs/summary.png
. We could also support subfolders (e.g. images
, videos
, raw-data
etc) if required.
In pseudocode:
get_blobs(account, container, filter = 'job_id'
would get all files for a job
get_blobs(account, container, filter = 'job_id/outputs'
would get all output files for a job
get_blobs(account, container, filter = 'job_id/outputs/videos'
would get all output files in the videos
subdirectory for a job
In real code, we would specify the prefix
parameter in the call to list_blobs
block_blob_service.list_blobs(container_name, prefix=None,...)
We could also support subfolders
Yes, definitely required. Eg, openfoam creates a directory for each time step.
To make things concrete, can we look at the structure for a real job (all files) and the structure you would like this to be in persistent storage?
Okay got it, list_blobs()
takes a prefix.
It also takes a delimiter
, which I think means that the behaviour changes from getting all blobs with a particular prefix (i.e. in any virtual subdirectory) to only returning blobs where the delimiter is not in the blob name after the prefix, plus also returning unique prefixes representing the immediate child virtual directories (i.e. the behaviour you'd expect from running ls
locally if with real directories). You should sanity check this though, as I haven't actually tried it myself.
Do you still have access to the science-gateway-cluster VM? I just turned in on.
Head to /home/vm-admin/OpenFOAM/vm-admin-5.0/run/cavity
vm-admin@science-gateway-cluster:~/OpenFOAM/vm-admin-5.0/run/cavity$ ls *
0:
p U
0.1:
p phi U uniform
0.2:
p phi U uniform
0.3:
p phi U uniform
0.4:
p phi U uniform
0.5:
p phi U uniform
constant:
polyMesh transportProperties
system:
blockMeshDict controlDict fvSchemes fvSolution
VTK:
cavity_0.vtk cavity_20.vtk cavity_60.vtk fixedWalls movingWall
cavity_100.vtk cavity_40.vtk cavity_80.vtk frontAndBack
vm-admin@science-gateway-cluster:~/OpenFOAM/vm-admin-5.0/run/cavity$ find .
.
./0
./0/U
./0/p
./0.5
./0.5/U
./0.5/p
./0.5/phi
./0.5/uniform
./0.5/uniform/time
./system
./system/fvSchemes
./system/blockMeshDict
./system/fvSolution
./system/controlDict
./0.3
./0.3/U
./0.3/p
./0.3/phi
./0.3/uniform
./0.3/uniform/time
./constant
./constant/polyMesh
./constant/polyMesh/neighbour
./constant/polyMesh/faces
./constant/polyMesh/boundary
./constant/polyMesh/owner
./constant/polyMesh/points
./constant/transportProperties
./0.1
./0.1/U
./0.1/p
./0.1/phi
./0.1/uniform
./0.1/uniform/time
./0.2
./0.2/U
./0.2/p
./0.2/phi
./0.2/uniform
./0.2/uniform/time
./0.4
./0.4/U
./0.4/p
./0.4/phi
./0.4/uniform
./0.4/uniform/time
./VTK
./VTK/cavity_40.vtk
./VTK/fixedWalls
./VTK/fixedWalls/fixedWalls_60.vtk
./VTK/fixedWalls/fixedWalls_0.vtk
./VTK/fixedWalls/fixedWalls_20.vtk
./VTK/fixedWalls/fixedWalls_100.vtk
./VTK/fixedWalls/fixedWalls_40.vtk
./VTK/fixedWalls/fixedWalls_80.vtk
./VTK/movingWall
./VTK/movingWall/movingWall_40.vtk
./VTK/movingWall/movingWall_80.vtk
./VTK/movingWall/movingWall_0.vtk
./VTK/movingWall/movingWall_20.vtk
./VTK/movingWall/movingWall_60.vtk
./VTK/movingWall/movingWall_100.vtk
./VTK/cavity_60.vtk
./VTK/cavity_80.vtk
./VTK/cavity_100.vtk
./VTK/cavity_0.vtk
./VTK/frontAndBack
./VTK/frontAndBack/frontAndBack_80.vtk
./VTK/frontAndBack/frontAndBack_40.vtk
./VTK/frontAndBack/frontAndBack_60.vtk
./VTK/frontAndBack/frontAndBack_20.vtk
./VTK/frontAndBack/frontAndBack_100.vtk
./VTK/frontAndBack/frontAndBack_0.vtk
./VTK/cavity_20.vtk
Cloud storage could be triggered from a STORE
action word and accompanying storage simulator-side script. To begin with, all jobs could call the STORE
action on completion. This could be made user configurable in the future.
Thinking out loud about the details. For byo cluster:
run.sh
(this is a short running script as it only runs a qsub pbs.sh
command)pbs.sh
script. Complication here is that a compute node would be running a data transfer process (will need to check that this is possible), i.e. we're spending compute node time on a process that could be handled by the login node.We could enforce that the compute node transfer all outputs
to the login node. This way when we later call a STORE
action, we can source content from the login node for transfer to cloud storage. The issue is that the compute node and its mounted scratch space are ephemeral.
We likely need to support a few different configurations:
Maybe we have the compute node passing output data directly to 'Storage' and, at the same time, sending a request to the 'Job manager' to mark the output file as "stored": true
or similar. This way we're not burdening the middleware file transfer bandwidth.
One hard part will be getting file transfers to occur while the job is still running. We need a background process in the pbs.sh
script which will periodically check for output. This assumes that we don't want to write file transfer code into the job's conventional simulation code (i.e. Blue, openfoam, etc.).
For direct transfer from compute node to Azure blob storage, we should look into https://rclone.org/ and similar command line tools.
rclone
can transfer only files that have changed. Hence, we periodically call rclone
and without having track which files have already been transferred.
I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.
rclone
supports this: i.e.
rclone local_data remote:blue/jobid
will create a virtual directory in the 'blue' container called 'jobid' and places the contents of local_data/
there.
We end up with URLs similar to: https://sgmiddleware.blob.core.windows.net/blue/jobid/1/output_03.vtk
Filtering will be important (https://rclone.org/filtering/) for synching the entire output structure and having the ability to ignore intermediate log files, intermediate csv files, etc.
@martintoreilly we could go with a daemon cloud storage transfer mechanism. There's a rough prototype here: https://github.com/alan-turing-institute/science-gateway-middleware/tree/sandbox/rclone_daemon
We might need something like this if we want the users to be able to see intermediate output. The user story I have in mind is someone wanting to monitor intermediate VTK graphics when deciding to stop/cancel a running job.
This will relay a subset (configurable via include-file.txt
) of intermediate data to Azure blob storage, but we should also support similar relaying to an internal organisation storage server (i.e. when the middleware and storage system are served from within the organisation). We'll need to work this out by dependency injection into the job manager system.
4f5556e5130df84dbfca7cc152bf04b00d5501ac connects openfoam to Azure storage. Full directory structure is copied to Azure: patched input files and full set of output files.
One idea is to auto-generate include-file.txt
using the JSON outputs
list. This way we can have fine-grained control over which outputs are stored by tweaking the JSON outputs
list. rclone
has strong support for *
wildcards so we can potentially use these to ensure that filtered sets/sequences of output files are stored on Azure.
Edit: this way links will be grouped by job_id
(eg, pressure field p
stored as https://sgmiddleware.blob.core.windows.net/openfoam/1c43fc8d-5d1c-43b7-9e07-5c2ea76df0b1/0.45/p) which is relayed from the middleware to the simulator via a mako-templated job_id
variable: https://github.com/alan-turing-institute/science-gateway-middleware/blob/4f5556e5130df84dbfca7cc152bf04b00d5501ac/middleware/job_information_manager.py#L114
A key idea in the PBS script is that we must ensure that the storage sync script completes a full iteration after the simulation finishes. This is achieved by capturing the storage daemon PID (STORAGE_DAEMON_PID=$!
), killing the daemon, then running the daemon script exactly once.
#!/bin/bash
#PBS -j oe
#PBS -o "TEST.out"
STORAGE_SCRIPT='./bin/storage_sync_azure.sh'
STORAGE_SYNC_FREQUENCY=10
echo "Start PBS"
# kill all child processes on exit
# note: this is likely handled automatically by the schedular
# trap "exit" INT TERM
# trap "kill 0" EXIT
set -vx
# emulate running in TMPDIR as per Imperial College cx1
TMPDIR="/tmp/pbs.$PBS_JOBID"
echo $TMPDIR
mkdir -p $TMPDIR
cp -r $PBS_O_WORKDIR/* $TMPDIR # TODO explicitly copy only required files
cd $TMPDIR
# run storage daemon loop and collect its PID number
(while `true`; do $STORAGE_SCRIPT; sleep $STORAGE_SYNC_FREQUENCY; done) &
STORAGE_DAEMON_PID=$!
source /opt/openfoam5/etc/bashrc
./Allrun
# here we ensure cloud storage is complete, before local cluster storage
# STORAGE SYSTEM A: cloud provider storage
# kill storage daemon loop and ensure that it completes
# one full cycle
kill STORAGE_DAEMON_PID
$STORAGE_SCRIPT
# STORAGE SYSTEM B: local cluster storage
# copy back timestep information
for timestep in $(foamListTimes); do
cp -r $TMPDIR/$timestep $PBS_O_WORKDIR
done
echo "End PBS"
We need to persist a subset of the simulation output data. The model should support two operation modes:
/tmp
of a worker node, but files may need to be transferred to/work/username/simulations
.