DmitryMezhensky / Hadoop-and-Swift-integration

API to run Hadoop MapReduce programs over Swift
20 stars 15 forks source link

Java library for Hadoop and Swift integration.

Documentation

Documentation and tutorial can be find on project Wiki

License

The use and distribution terms for this software are covered by Apache License 2.0

References

Reference to Hadoop issue HADOOP-8545

Abstract

This patch enables support of OpenStack Swift Object Storage. Swift storage hierarchy is account -> container -> object. Account contains containers (Amazon S3 buckets analogue), each container holds huge number of objects, which are stored as blob.

Hadoop can work with different Swift regions/installations, public or private.

It can also work with the same object stores using multiple login details.

Hadoop and Swift integration concepts

Containers are created by users with accounts on the Swift filestore, and hold objects.

Client applications talk to Swift over HTTP or HTTPS, reading, writing and deleting objects using standard HTTP operations (GET, PUT and DELETE, respectively). There is also a COPY operation, that creates a new object in the container, with a new name, containing the old data. There is no rename operation itself, objects need to be copied -then the original entry deleted.

The Swift Filesystem is eventually consistent: an operation on an object may not be immediately visible to that client, or other clients. This is a consequence of the goal of the filesystem: to span a set of machines, across multiple datacentres, in such a way that the data can still be available when many of them fail. (In contrast, the Hadoop HDFS filesystem is immediately consistent, but it does not span datacenters.)

Eventual consistency can cause surprises for client applications that expect immediate consistency: after an object is deleted or overwritten, the object may still be visible -or the old data still retrievable. The Swift Filesystem client for Apache Hadoop attempts to handle this, in conjunction with the MapReduce engine, but there may be still be occasions when eventual consistency causes suprises.

Warnings

  1. Do not share your login details with anyone, which means do not log the details, or check the XML configuration files into any revision control system to which you do not have exclusive access.
  2. Similarly, no use your real account details in any documentation.
  3. Do not use the public service endpoint from within an OpenStack cluster, as it will run up large bills.
  4. Remember: it's not a real filesystem or hierarchical directory structure. Some operations (directory rename and delete) take time and are not atomic or isolated from other operations taking place.
  5. Append is not supported.
  6. Unix-style permissions are not supported. All accounts with write access to a repository have unlimited access; the same goes for those with read access.
  7. In the public clouds, do not make the containers public unless you are happy with anyone reading your data, and are prepared to pay the costs of their downloads.

Limits