Closed etorreborre closed 11 years ago
Scoobi has the ability to persist DList
s to files on HDFS when the computation graph is executed, but there are no APIs that will persist DObject
s to files on HDFS as part of this process.
An example of what we'd like to be able to do:
val big_table: DList[(String, Int)] = ...
val little_table: DObject[Seq[(String, Int)]] = ...
persist(toTextFile(big_table, "foo"), toTextFile(little_table, "bar"))
On the call to persist
Scoobi will persist both big_table
and little_table
as text files on HDFS. The differences are:
big_table
, a DList
is persisted via a DataSink
invoked in a reducer task and then final write is performed by a referenced OutputFormat
;little_table
, a DObject
needs to be persisted from the client "task" and ideally leverages existing DataSink
s, e.g.
DataSink
and its referenced OutputFormat
to write out a file on the clientDataSink
and OutputFormat
objects in a similar way so should be possible)The DList
persist helpers (toTextFile
and friends) should all be able to be ported such that they can be applied to a DObject
if it of type Iterable[T]
or something similar. If it is simply of type A
, the approach may be to internally convert it to an Iterable[T]
.
As a follow-up, it should also be possible to leverage DataSource
s for loading into a DObject
. In the same way we should be able to reuse an OutputFormat
, we should be able to do the same for an InputFormat
. This might also mean we can have very similar loading APIs for DObject
s as we do for DList
s. For example, fromTextFile
and fromAvroFile
would "just work".
Once this implemented, we should circle back and update this thread - https://groups.google.com/forum/?fromgroups=#!topic/scoobi-users/pBmQUvltNrs
See #172 for the new persist API