TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Provide API to write to HDFS #1039

Closed Kai-Chen closed 6 years ago

Kai-Chen commented 6 years ago

Need to be usable by Python, so may involve conversion from Python stream to Java stream.

jacobdr commented 6 years ago

It would be great if this makes it easier to start writing ORC and Parquet

Jacob

On Dec 19, 2017, 17:04 -0500, Kai Chen notifications@github.com, wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

AliTajeldin commented 6 years ago

ORC/Parquet are outside the scope of this ticket.

jacobdr commented 6 years ago

We’ll talk about this offline, but given how standard those formats are, I wanted to put flexibility on the mind of the ticket implementor. But I guess out of scope for this (vaguely written) ticket...

Jacob

On Dec 19, 2017, 19:28 -0500, Ali Tajeldin notifications@github.com, wrote:

ORC/Parquet are outside the scope of this ticket. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Kai-Chen commented 6 years ago

@kmannislands After the PR is merged, see testHdfsConn.py for example of use.

The entry point from Python is SmvApp.copyToHdfs.

The only expectation is that the file object from the upload is opened in binary read mode. Let me know if this is not the case, and we will work on that.

Cheers!

Kai-Chen commented 6 years ago

@jacobdr ... since my mind is mentioned :)

A stream of bytes is probably as flexible as you can get with any api, so you can certainly upload an ORC or a Parquet file.

But conversion is a different matter -- not the kind of matter that I'd mind, though :)

-- Sorry about the bad puns ... gotta get that out of my system ... it's holidays.