Long-term file storage - Githubissues

masonlr commented 6 years ago

We need to persist a subset of the simulation output data. The model should support two operation modes:

In "cloud" mode, files should be transferred to azure storage using the azure API
In "cluster" mode, the user should specify a target directory for long term storage. For example, the simulation may run on the /tmp of a worker node, but files may need to be transferred to /work/username/simulations.

This could be handled in the custom shell scripts, but it would be useful for the user to be able to specify the target storage directory through the web interface.

martintoreilly commented 6 years ago

I'd suggest we make this part of the "Job manager" configuration as this will already need to handle different communication and execution mechanisms for different compute platforms. That said, we may want to allow specification of cloud or other "off cluster" persistent storage for cluster compute platforms also. I'd suggest defining a storage interface we can then plug into a jab manager instance.

masonlr commented 6 years ago

Sounds good. Testing with python Azure for the moment and plan to inject a generic storage handler into the job manager using an Azure specific example for the moment. The actual code will call the azure.storage.blob python library, for example:

#!/usr/bin/env python

from azure.storage.blob import BlockBlobService
from azure.storage.blob import ContentSettings

# set credentials
block_blob_service = BlockBlobService(
    account_name='SECRET',
    account_key='SECRET')

# show existing blobs
generator = block_blob_service.list_blobs('blue')
for blob in generator:
    print(blob.name)

# upload a new blob
block_blob_service.create_blob_from_path(
    'blue',
    'test.png',
    'content.png',
    content_settings=ContentSettings(content_type='image/png')
            )

masonlr commented 6 years ago

At least two different approaches are possible (these apply independent of whether we are using an Azure VM or a local cluster for the simulator computation):

Transfer file directly from simulator to Azure storage (would require custom python Azure module installed on VM).

con: have to manage simulator-side dependencies. (We are already doing this though, for example we require the python vtk module to be present, hence it is not much of a stretch to require the python azure module to be present.)
Copy file to middleware (via paramiko scp) and allow middleware to relay file to Azure storage.

con: does this mean we're making two transfers? 1. simulator to middleware, 2. middleware to storage. (Middleware-to-storage transfer may be bandwidth cheap if both are hosted on Azure, but this doesn't generalise to other App services, i.e. bring your own Azure/AWS/GCE compute account mixed with bring your own Azure/AWS/GCE storage account)

masonlr commented 6 years ago

I will start on making a prototype for option 1 above.

masonlr commented 6 years ago

In the Case structure we could support an 'outputs' field, for example:

{
  "outputs": [
    {
      "source_uri": "output/summary.png",
      "destination_path": ".",
      "content_type": "image/png",
    },
    {
      "source_uri": "output/output_01.vtk",
      "destination_path": ".",
      "content_type": "application/octet-stream",
    },
    {
      "source_uri": "output/output_02.vtk",
      "destination_path": ".",
      "content_type": "application/octet-stream",
    }
  ]
}

This structure would be parsed to trigger file transfers:

import os
job_id = 'UUID'
for f in outputs:
    basename = os.path.basename(f['source_uri'])
    destination_path = os.path.join(f['destination_path'], basename)
    block_blob_service.create_blob_from_path(
        'job_id',
        f['source_uri'],
        destination_path,
        content_settings=ContentSettings(content_type=f['content_type']))

In this way, we could assign a block blob for each job (the block blob would be labelled by the job id).

martintoreilly commented 6 years ago

Best practice for managing access to Azure storage is to generate SAS tokens. I'd suggest we do something like the following:

Our middleware has an associated "Service Principal" user on the Azure subscription, which is granted access to any required ability to access / create / run Azure resources. For persistent storage, this would include read/write/create access to a Blob storage account (or potentially a container within a blob storage account).
The middleware uses its Service Principal privileges to:
- Create a container within the blob account if necessary
- Generate SAS tokens for use by the scripts running on the VM and uploads them to the VM on creation. I'd recommend we mint SAS tokens on a per-job basis and keep their lifetime short (say the expected execution time of the job + some buffer). To start with let's just make these 24 hour expiry.
The VM to use something similar to the az-storage.py script in our azure-batch-tools repo. I'd prefer we used the raw Python SDK rather than the Azure CLI client, but am happy for us to just tweak the current script to use a different location for the SAS token to start with to get things working. Note that bob storage is key-value pair storage, with blob names as keys and blob contents as values. However, "/"s in blob names will act as virtual directories in the portal UI and the file explorer app.

Resources

martintoreilly commented 6 years ago

@masonlr What are your thoughts on directory structure for persistent storage? Let's assume for now that an organisation or team might assign a single account for all their simulations.

martintoreilly commented 6 years ago

With reproducibility / verifiability in mind, would we want to store a bunch of input data on persistent storage too (or at least offer the option)? It would be quite cool if someone could copy the directory for a job and re-run it on another compute resource. The main issue I can think of would be how to ensure the same version of the simulation software runs in cases where we don't want to export the simulator to persistent storage and/or the simulator is a platform specific binary. [Edit]: Maybe doable if each run is a docker container? However, that might not play nice with proper clusters.

masonlr commented 6 years ago

Yes, definitely keen to store input files. We will need to experiment with this, i.e. do we store the patched input files or do we store the templates and patches together (to allow tweaking of parameters on re-run, for example).

martintoreilly commented 6 years ago

This structure would be parsed to trigger file transfers:
import os
job_id = 'UUID'
for f in outputs:
basename = os.path.basename(f['source_uri'])
destination_path = os.path.join(f['destination_path'], basename)
block_blob_service.create_blob_from_path(
'job_id',
f['source_uri'],
destination_path,
content_settings=ContentSettings(content_type=f['content_type']))
In this way, we could assign a block blob for each job (the block blob would be labelled by the job id).

So my reading of the above is that each job would have a container and the blob name would be a concatenation of destination_path and source_uri from the config file. Is this correct?

My feeling is that one container per job feels like overkill. Filtering on blob name prefix is easy and / characters in the blob name are treated as virtual directories in the Azure web portal and Azure storage explorer app.
I think that destination_path should really be a path (i.e. including filename) relative to the root of the storage account / container set for the job.

masonlr commented 6 years ago

job_id sets the blob name here. A blob can hold many files. Files can also be added to an existing blob.

masonlr commented 6 years ago

So in this model, we'd have a container for each user (?) and a blob for each job (?)

martintoreilly commented 6 years ago

I don't think that's true. A single bob == a single file. From the azure Python SDK docs

create_blob_from_path(container_name blob_name, file_path, content_settings=None,...)

masonlr commented 6 years ago

Ah sorry, multitasking and confused. Yes the file explorer shows this:

screen shot 2018-01-04 at 11 18 07 am

I think that destination_path should really be a path (i.e. including filename) relative to the root of the storage account / container set for the job.

On complication is if we want to support wildcards. For example, we may not know the names of the output files a priori.

masonlr commented 6 years ago

So does this mean we can use virtual directories arranged by job id?

masonlr commented 6 years ago

So my reading of the above is that each job would have a container and the blob name would be a concatenation of destination_path and source_uri from the config file. Is this correct?

I agree that the notation needs to change. At the moment, we're assuming that _uri is a path (relative or full) to a file and _path is a path to a directory. In the model above we're pulling the filename (output_01.vtk) from the _uri component, and then concatenating this to the _path directory.

martintoreilly commented 6 years ago

So in this model, we'd have a container for each user (?) and a blob for each job (?)

I'd be inclined to just let the user pick either an Account or a Container as the "root" storage location and then create blobs under these, using prefixes for lower level grouping. Azure supports unlimited containers per account, but AAWS only supports 100 buckets (AWS container equivalent) per account by default. This can be increased with a service request but we shouldn't rely on being able to create an arbitrary number of containers.

In terms of prefix patterns, given that "moving" a blob from one "virtual directory" to another by changing it's prefix is non-trivial, I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.

martintoreilly commented 6 years ago

On complication is if we want to support wildcards. For example, we may not know the names of the output files a priori.

If the filename is in the source_uri, then we know it. If we want to support a model where the source_uri is source/directory/structure/*.ext using pattern matching, we should support the same pattern matching in the destination. I can see the benefit in a "same as source_uri" shortcut, but would suggest we copy the pattern matching used by the linux cp or scp commands.

martintoreilly commented 6 years ago

suggest we copy the pattern matching used by the linux cp or scp commands.

Hmm. That might not be as straightforward as I thought

masonlr commented 6 years ago

"." wouldn't quite mean "same as source_uri" in the current model, it would mean put the file in the root directory (as we're using basename above to extract the filename).

martintoreilly commented 6 years ago

Maybe duplicate the syntax for cp with globstar shell option. From the same link as above

If you're using Bash, you can turn on the globstar shell option to match files and directories recursively:
shopt -s globstar
cp src/**/*.so dst

masonlr commented 6 years ago

I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.

Okay. If we're trying to retrieve all files associated with a job (for the reproduction idea above), can we easily query by the job-id stem in the virtual storage path?

masonlr commented 6 years ago

We could extract a subset of files from a container,

# get all blobs in container
generator = block_blob_service.list_blobs('container') 

for blob in generator:
    # here we add logic to filter the retrieved list by job-id prefix
    print(blob.name)

martintoreilly commented 6 years ago

It's easy to filter a query for blobs by a prefix, but wildcards within the prefix aren't supported. I'm thinking something like job-id/outputs/summary.png. We could also support subfolders (e.g. images, videos, raw-data etc) if required.

In pseudocode: get_blobs(account, container, filter = 'job_id' would get all files for a job get_blobs(account, container, filter = 'job_id/outputs' would get all output files for a job get_blobs(account, container, filter = 'job_id/outputs/videos' would get all output files in the videos subdirectory for a job

masonlr commented 6 years ago

For reference, https://stackoverflow.com/questions/14440506/how-to-query-cloud-blobs-on-windows-azure-storage

martintoreilly commented 6 years ago

In real code, we would specify the prefix parameter in the call to list_blobs

block_blob_service.list_blobs(container_name, prefix=None,...)

masonlr commented 6 years ago

We could also support subfolders

Yes, definitely required. Eg, openfoam creates a directory for each time step.

martintoreilly commented 6 years ago

To make things concrete, can we look at the structure for a real job (all files) and the structure you would like this to be in persistent storage?

masonlr commented 6 years ago

Okay got it, list_blobs() takes a prefix.

martintoreilly commented 6 years ago

It also takes a delimiter, which I think means that the behaviour changes from getting all blobs with a particular prefix (i.e. in any virtual subdirectory) to only returning blobs where the delimiter is not in the blob name after the prefix, plus also returning unique prefixes representing the immediate child virtual directories (i.e. the behaviour you'd expect from running ls locally if with real directories). You should sanity check this though, as I haven't actually tried it myself.

masonlr commented 6 years ago

Do you still have access to the science-gateway-cluster VM? I just turned in on.

masonlr commented 6 years ago

Head to /home/vm-admin/OpenFOAM/vm-admin-5.0/run/cavity

masonlr commented 6 years ago

vm-admin@science-gateway-cluster:~/OpenFOAM/vm-admin-5.0/run/cavity$ ls *
0:
p  U

0.1:
p  phi  U  uniform

0.2:
p  phi  U  uniform

0.3:
p  phi  U  uniform

0.4:
p  phi  U  uniform

0.5:
p  phi  U  uniform

constant:
polyMesh  transportProperties

system:
blockMeshDict  controlDict  fvSchemes  fvSolution

VTK:
cavity_0.vtk    cavity_20.vtk  cavity_60.vtk  fixedWalls    movingWall
cavity_100.vtk  cavity_40.vtk  cavity_80.vtk  frontAndBack

masonlr commented 6 years ago

vm-admin@science-gateway-cluster:~/OpenFOAM/vm-admin-5.0/run/cavity$ find .
.
./0
./0/U
./0/p
./0.5
./0.5/U
./0.5/p
./0.5/phi
./0.5/uniform
./0.5/uniform/time
./system
./system/fvSchemes
./system/blockMeshDict
./system/fvSolution
./system/controlDict
./0.3
./0.3/U
./0.3/p
./0.3/phi
./0.3/uniform
./0.3/uniform/time
./constant
./constant/polyMesh
./constant/polyMesh/neighbour
./constant/polyMesh/faces
./constant/polyMesh/boundary
./constant/polyMesh/owner
./constant/polyMesh/points
./constant/transportProperties
./0.1
./0.1/U
./0.1/p
./0.1/phi
./0.1/uniform
./0.1/uniform/time
./0.2
./0.2/U
./0.2/p
./0.2/phi
./0.2/uniform
./0.2/uniform/time
./0.4
./0.4/U
./0.4/p
./0.4/phi
./0.4/uniform
./0.4/uniform/time
./VTK
./VTK/cavity_40.vtk
./VTK/fixedWalls
./VTK/fixedWalls/fixedWalls_60.vtk
./VTK/fixedWalls/fixedWalls_0.vtk
./VTK/fixedWalls/fixedWalls_20.vtk
./VTK/fixedWalls/fixedWalls_100.vtk
./VTK/fixedWalls/fixedWalls_40.vtk
./VTK/fixedWalls/fixedWalls_80.vtk
./VTK/movingWall
./VTK/movingWall/movingWall_40.vtk
./VTK/movingWall/movingWall_80.vtk
./VTK/movingWall/movingWall_0.vtk
./VTK/movingWall/movingWall_20.vtk
./VTK/movingWall/movingWall_60.vtk
./VTK/movingWall/movingWall_100.vtk
./VTK/cavity_60.vtk
./VTK/cavity_80.vtk
./VTK/cavity_100.vtk
./VTK/cavity_0.vtk
./VTK/frontAndBack
./VTK/frontAndBack/frontAndBack_80.vtk
./VTK/frontAndBack/frontAndBack_40.vtk
./VTK/frontAndBack/frontAndBack_60.vtk
./VTK/frontAndBack/frontAndBack_20.vtk
./VTK/frontAndBack/frontAndBack_100.vtk
./VTK/frontAndBack/frontAndBack_0.vtk
./VTK/cavity_20.vtk

masonlr commented 6 years ago

Cloud storage could be triggered from a STORE action word and accompanying storage simulator-side script. To begin with, all jobs could call the STORE action on completion. This could be made user configurable in the future.

masonlr commented 6 years ago

Thinking out loud about the details. For byo cluster:

Job is submitted from login node via run.sh (this is a short running script as it only runs a qsub pbs.sh command)
We're not using an inverted push updates/notifications model yet (or alternatively we don't have a continuous running simulator-side polling script), so any storage commands will need to be included at the end of the pbs.sh script. Complication here is that a compute node would be running a data transfer process (will need to check that this is possible), i.e. we're spending compute node time on a process that could be handled by the login node.

masonlr commented 6 years ago

We could enforce that the compute node transfer all outputs to the login node. This way when we later call a STORE action, we can source content from the login node for transfer to cloud storage. The issue is that the compute node and its mounted scratch space are ephemeral.

masonlr commented 6 years ago

We likely need to support a few different configurations:

'Job manager' on Azure web app service, 'Storage' on Azure blob storage. Direct communication between simulator and Azure blob storage to minimise load on middleware (?).
'Job manager' on internal organisation server, 'Storage' on internal organisation server (possibly co-located with 'Job manager')

masonlr commented 6 years ago

Maybe we have the compute node passing output data directly to 'Storage' and, at the same time, sending a request to the 'Job manager' to mark the output file as "stored": true or similar. This way we're not burdening the middleware file transfer bandwidth.

masonlr commented 6 years ago

One hard part will be getting file transfers to occur while the job is still running. We need a background process in the pbs.sh script which will periodically check for output. This assumes that we don't want to write file transfer code into the job's conventional simulation code (i.e. Blue, openfoam, etc.).

masonlr commented 6 years ago

For direct transfer from compute node to Azure blob storage, we should look into https://rclone.org/ and similar command line tools.

rclone can transfer only files that have changed. Hence, we periodically call rclone and without having track which files have already been transferred.

masonlr commented 6 years ago

I'd strongly suggest we stick with a single level job-id/within/job/folder/structure/filename naming convention for blobs and manage groupings by user / organisation / team / case as metadata stored in the middleware database.

rclone supports this: i.e.

rclone local_data remote:blue/jobid

will create a virtual directory in the 'blue' container called 'jobid' and places the contents of local_data/ there.

We end up with URLs similar to: https://sgmiddleware.blob.core.windows.net/blue/jobid/1/output_03.vtk

masonlr commented 6 years ago

Filtering will be important (https://rclone.org/filtering/) for synching the entire output structure and having the ability to ignore intermediate log files, intermediate csv files, etc.

masonlr commented 6 years ago

@martintoreilly we could go with a daemon cloud storage transfer mechanism. There's a rough prototype here: https://github.com/alan-turing-institute/science-gateway-middleware/tree/sandbox/rclone_daemon

We might need something like this if we want the users to be able to see intermediate output. The user story I have in mind is someone wanting to monitor intermediate VTK graphics when deciding to stop/cancel a running job.

This will relay a subset (configurable via include-file.txt) of intermediate data to Azure blob storage, but we should also support similar relaying to an internal organisation storage server (i.e. when the middleware and storage system are served from within the organisation). We'll need to work this out by dependency injection into the job manager system.

masonlr commented 6 years ago

4f5556e5130df84dbfca7cc152bf04b00d5501ac connects openfoam to Azure storage. Full directory structure is copied to Azure: patched input files and full set of output files.

One idea is to auto-generate include-file.txt using the JSON outputs list. This way we can have fine-grained control over which outputs are stored by tweaking the JSON outputs list. rclone has strong support for * wildcards so we can potentially use these to ensure that filtered sets/sequences of output files are stored on Azure.

Edit: this way links will be grouped by job_id (eg, pressure field p stored as https://sgmiddleware.blob.core.windows.net/openfoam/1c43fc8d-5d1c-43b7-9e07-5c2ea76df0b1/0.45/p) which is relayed from the middleware to the simulator via a mako-templated job_id variable: https://github.com/alan-turing-institute/science-gateway-middleware/blob/4f5556e5130df84dbfca7cc152bf04b00d5501ac/middleware/job_information_manager.py#L114

screen shot 2018-01-08 at 3 15 08 pm

masonlr commented 6 years ago

A key idea in the PBS script is that we must ensure that the storage sync script completes a full iteration after the simulation finishes. This is achieved by capturing the storage daemon PID (STORAGE_DAEMON_PID=$!), killing the daemon, then running the daemon script exactly once.

#!/bin/bash
#PBS -j oe
#PBS -o "TEST.out"

STORAGE_SCRIPT='./bin/storage_sync_azure.sh'
STORAGE_SYNC_FREQUENCY=10

echo "Start PBS"

# kill all child processes on exit
# note: this is likely handled automatically by the schedular
# trap "exit" INT TERM
# trap "kill 0" EXIT

set -vx

# emulate running in TMPDIR as per Imperial College cx1
TMPDIR="/tmp/pbs.$PBS_JOBID"

echo $TMPDIR

mkdir -p $TMPDIR

cp -r $PBS_O_WORKDIR/* $TMPDIR  # TODO explicitly copy only required files
cd $TMPDIR

# run storage daemon loop and collect its PID number
(while `true`; do $STORAGE_SCRIPT; sleep $STORAGE_SYNC_FREQUENCY; done) &
STORAGE_DAEMON_PID=$!

source /opt/openfoam5/etc/bashrc
./Allrun

# here we ensure cloud storage is complete, before local cluster storage

# STORAGE SYSTEM A: cloud provider storage
# kill storage daemon loop and ensure that it completes
# one full cycle
kill STORAGE_DAEMON_PID
$STORAGE_SCRIPT

# STORAGE SYSTEM B: local cluster storage
# copy back timestep information
for timestep in $(foamListTimes); do
  cp -r $TMPDIR/$timestep $PBS_O_WORKDIR
done

echo "End PBS"

alan-turing-institute / science-gateway-middleware

Long-term file storage #53

Resources