Add a DObject API for saving and loading DObjects

etorreborre commented 11 years ago

See #172 for the new persist API

blever commented 11 years ago

Scoobi has the ability to persist DLists to files on HDFS when the computation graph is executed, but there are no APIs that will persist DObjects to files on HDFS as part of this process.

An example of what we'd like to be able to do:

val big_table: DList[(String, Int)] = ...
val little_table: DObject[Seq[(String, Int)]] = ...
persist(toTextFile(big_table, "foo"), toTextFile(little_table, "bar"))

On the call to persist Scoobi will persist both big_table and little_table as text files on HDFS. The differences are:

big_table, a DList is persisted via a DataSink invoked in a reducer task and then final write is performed by a referenced OutputFormat;
little_table, a DObject needs to be persisted from the client "task" and ideally leverages existing DataSinks, e.g.
- use a DataSink and its referenced OutputFormat to write out a file on the client
- copy the local file to the target filesystem, e.g. HDFS
- (note, in-memory mode is able to use DataSink and OutputFormat objects in a similar way so should be possible)

The DList persist helpers (toTextFile and friends) should all be able to be ported such that they can be applied to a DObject if it of type Iterable[T] or something similar. If it is simply of type A, the approach may be to internally convert it to an Iterable[T].

blever commented 11 years ago

As a follow-up, it should also be possible to leverage DataSources for loading into a DObject. In the same way we should be able to reuse an OutputFormat, we should be able to do the same for an InputFormat. This might also mean we can have very similar loading APIs for DObjects as we do for DLists. For example, fromTextFile and fromAvroFile would "just work".

blever commented 11 years ago

Once this implemented, we should circle back and update this thread - https://groups.google.com/forum/?fromgroups=#!topic/scoobi-users/pBmQUvltNrs

NICTA / scoobi

Add a DObject API for saving and loading DObjects #175