Light Hadoop-Galaxy integration.
Did you ever want to use Hadoop-based programs in your Galaxy workflows? This project can help you out with that. Hadoop-Galaxy provides a light integration between Hadoop-based tools and Galaxy, allowing you to run Hadoop tools from Galaxy and mix them with regular tools in your workflows.
Install the hadoop_galaxy
Tool Shed repository on your Galaxy installation, which will add a new pathset data type
and a few tools in your Galaxy menu which you might need to use in your
workflows. Finally, it will install a Python executable hadoop_galaxy
(together with its dependencies) that is
the adaptor you need to use to run Hadoop-based programs in Galaxy.
You'll want to configure Hadoop-Galaxy to work appropriately for your computing environment. There are only a few configuration values and you can specify them via environment variables.
HADOOP_GALAXY_DATA_DIR: path where the hadoop_galaxy
adapter will have
the Hadoop jobs write their output data. To have your Hadoop jobs write to HDFS
set this variable to something like
hdfs://your.name.node:9000/user/galaxy/workspace
.
hadoop_output
in the
Galaxy workspace. Note that as a result by default the Hadoop jobs will not
write to HDFS.HADOOP_GALAXY_PUT_DIR: the Hadoop-accessible directory to which the
put_dataset
tool (see below) will copy Galaxy datasets.
HADOOP_GALAXY_LOG_LEVEL: log level for tools and adapter. Must be one of 'debug', 'info', 'warn', 'error', 'critical'. This variable is not yet respected by all Hadoop-Galaxy components (work-in-progress).
You'll need to write a Galaxy wrapper for your tool. For example, see the
wrapper for dist_text_zipper
:
<tool id="hg_dtz" name="Dist TextZipper">
<description>Compress lots of text</description>
<command>
hadoop_galaxy
--input $input
--output $output
--executable dist_text_zipper
</command>
<inputs>
<param name="input" type="data" format="pathset"/>
</inputs>
<outputs>
<data name="output" format="pathset" />
</outputs>
<stdio>
<exit_code range="1:" level="fatal" />
</stdio>
</tool>
dist_text_zipper
is a Hadoop program for compressing text data. It is
bundled with Hadoop-Galaxy and the executable is found in the PATH. It takes
two arguments: one or more input paths and an output path.
To use it through Galaxy, we call dist_text_zipper
through the
hadoop_galaxy
adapter, specifying:
dist_text_zipper
)When writing your own wrapper for another Hadoop program, just replaced
dist_text_zipper
with whatever the executable name is. If your program
takes more arguments, just append them at the end of the command, after the
arguments you see in the example above; here is an example from
seal-galaxy:
hadoop_galaxy --input $input_data --output $output1 --executable seal bcl2qseq --ignore-missing-bcl
The adapter will take care of managing the pathset files provided by Galaxy and will pass appropriate paths to the Hadoop program.
Here's an example pathset file:
# Pathset Version:0.0 DataType:Unknown
hdfs://hdfs.crs4.int:9000/data/sample-cs3813-fastq-r1
hdfs://hdfs.crs4.int:9000/data/sample-cs3813-fastq-r2
Pathsets are a list of paths. Each path represents a part of the entire
dataset. Directories include all files under their entire tree, in alphabetical
order. You can also use the shell-like wildcard patters ?
, *
, []
.
Order is important (the order of the parts determines the order of the
data in the overall dataset).
In your workflows you can mix Hadoop-based and conventional programs, though not automatically. You'll need to keep track of when the data is on Galaxy storage/format and when it's on Hadoop storage/format -- and use the Hadoop-Galaxy utilities to shuffle between them.
Consider the following simple (admittedly contrived) workflow to acquire some
DNA data from a service (e.g., EBI ENA) and compress it using your Hadoop
cluster and the dist_text_zipper
tool (run through the hadoop_galaxy
adapter) provided with Hadoop-Galaxy.
Conventional access | Hadoop access
|
+-----------------+ |
| 1. Download DNA | |
+-----------------+ |
\
+-----------------+
| 2. make_pathset |
+-----------------+
\
| +---------------------+
| | 3. dist_text_zipper |
| +---------------------+
/
+---------------+
compressed --> | 4. cat_paths |
DNA file +---------------+
|
|
Dataset 1 is a conventional Galaxy dataset. In step 2, make_pathset
creates a
pathset that references it so that it can be accessed by the Hadoop program in step 3.
If the Hadoop nodes cannot access Galaxy's storage space directly, you'll have
to insert an intermediate put_dataset
step to copy the data from one storage
to the other. The last step concatenates the output part files produced by the
Hadoop program into a single file, simultaneously copying them to the
Galaxy workspace.
make_pathset: create a new pathset that references files or directories provided as input. The input can be provided by connecting the output of another Galaxy tool, thus providing a bridge to take a "regular" Galaxy dataset and pass it to a Hadoop-based tool. Alternatively, the input can be given as a direct parameter, which can be useful, for instance, for creating workflows where the user specifies the input path as an argument.
cat_paths: take the list of part files that make up a dataset and concatenate them into a single file.
This tool effectively provides the operation inverse to make_pathset
: while
make_pathset
creates a level of indirection by writing a new pathset that
references data files, cat_paths
copies the referenced data into a single
file. The new destination file exists within Galaxy's workspace and can
therefore be used by other standard Galaxy tools.
cat_paths
also provides a distributed mode that uses the entire Hadoop
cluster to copy data chunks to the same file in parallel. For this feature to
work, the Galaxy workspace must be on a parallel shared file system accessible
by all Hadoop nodes.
put_dataset: copy data from the Galaxy workspace to Hadoop storage. If your
configuration has the Galaxy workspace on a storage volume that is not directly
accessible by the Hadoop cluster, you'll need to copy your data to the Hadoop
storage before feeding it to Hadoop programs. For this task Hadoop-Galaxy
provides the put_dataset
tool, which works in a way analogous to the command
hadoop dfs -put <local file> <remote file>
The put_dataset
operation receives a pathset as input, copies the referenced
data to the new location and writes the new path to the output pathset. It is
important to keep in mind that this copy operation needs to run directly on the
server that has access the Galaxy workspace so it will be time-consuming for
large files. If large files need to be passed between the Galaxy and Hadoop
workspaces, it is a much faster solution to have the Galaxy workspace on a
parallel shared file system that can be accessed directly by the Hadoop cluster,
thus eliminating the need for put_dataset
.
split_pathset: split a pathset into two parts based on path or filename. It lets you define a regular expression as a test: the path elements are then accordingly placed in the "match" or "mismatch" output pathset. This type of tool is handy in cases when the Hadoop tool associates a meaning to the output file. For instance, some tools in the Seal suite for processing sequencing data can separate DNA fragments based on whether they were produced in the first or second sequencing phase (i.e., read 1 or read 2) putting them each under a separate directory: with this tool you can split the Galaxy dataset into two parts and process them separately.
dist_text_zipper: a tool for parallel (Hadoop-based) compression of text files. Although this tool is not required for the use of Hadoop-Galaxy in user workflows, it is a generally useful utility that doubles as an example illustrating how to integrate Hadoop-based tools with Galaxy using our adapter.
In most cases, Hadoop tools cannot be used directly from Galaxy for two reasons:
We've come with a solution that works by adding a layer of indirection. We created a new Galaxy datatype, the pathset, to which we write the HDFS paths (or any other non-mounted filesystem, such as S3). These are serialized as text file that are handled handled directly by Galaxy as datasets.
Then, Hadoop-Galaxy provides an adapter program through which you call your Hadoop-based program. But, rather than passing the data paths directly to the adapter, you pass it the pathset; the adapter takes care of reading the pathset files and passing the real data paths to your program.
An implication of the layer of indirection is that Galaxy knows nothing about your actual data. Because of this, removing the Galaxy datasets does not delete the files produced by your Hadoop runs, potentially resulting in the waste of a lot of space. In addition, as typical with pointers you can end up in situations where multiple pathsets point to the same data, or where they point to data that you want to access from Hadoop but would not want to delete (e.g., your run directories).
A proper solution would include a garbage collection system to be run with Galaxy's clean up action, but we haven't implemented this yet. Instead, at the moment we handle this issue as follows. Since we only use Hadoop for intermediate steps in our pipeline, we don't permanently store any of its output. So, we write this data to a temporary storage space. From time to time, we stop the pipeline and remove the entire contents.
If you publish work that uses Hadoop-Galaxy, please cite the Hadoop-Galaxy article.