NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Add a DObject API for saving and loading DObjects #175

Closed etorreborre closed 11 years ago

etorreborre commented 11 years ago

See #172 for the new persist API

blever commented 11 years ago

Scoobi has the ability to persist DLists to files on HDFS when the computation graph is executed, but there are no APIs that will persist DObjects to files on HDFS as part of this process.

An example of what we'd like to be able to do:

val big_table: DList[(String, Int)] = ...
val little_table: DObject[Seq[(String, Int)]] = ...
persist(toTextFile(big_table, "foo"), toTextFile(little_table, "bar"))

On the call to persist Scoobi will persist both big_table and little_table as text files on HDFS. The differences are:

The DList persist helpers (toTextFile and friends) should all be able to be ported such that they can be applied to a DObject if it of type Iterable[T] or something similar. If it is simply of type A, the approach may be to internally convert it to an Iterable[T].

blever commented 11 years ago

As a follow-up, it should also be possible to leverage DataSources for loading into a DObject. In the same way we should be able to reuse an OutputFormat, we should be able to do the same for an InputFormat. This might also mean we can have very similar loading APIs for DObjects as we do for DLists. For example, fromTextFile and fromAvroFile would "just work".

blever commented 11 years ago

Once this implemented, we should circle back and update this thread - https://groups.google.com/forum/?fromgroups=#!topic/scoobi-users/pBmQUvltNrs