Avoid shipping jar when using spark as a filesystem

tomerk commented 9 years ago

When treating spark as a general file system to read & write into hdfs, s3, etc. (e.g. writing observations & reading user weights following a retrain), avoid shipping the JAR! May not be naively possible in some cases (e.g. user-defined contexts).

Longer-term, this would be fixed by issue #48, and by writing to/from HDFS & other filesystems depending on how closely we decide to tie Velox to specific spark cluster configurations & file destinations

tomerk commented 9 years ago

@shivaram mentioned that it's fine to use spark to read to & from a filesystem, but he recommended just using a local spark context and connecting to the filesystem on the remote spark cluster instead of connecting a spark context to the remote spark cluster. This should solve this issue because we wouldn't need to ship any jars.

shivaram commented 9 years ago

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

tomerk commented 9 years ago

We were trying to use that before (at least I think that's what Dan was using), but there was much more effort required to get it configured and working correctly when connecting to a spark ec2 cluster.

— Best Wishes, Tomer Kaftan

On Wed, Apr 15, 2015 at 5:56 PM, Shivaram Venkataraman notifications@github.com wrote:

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93608398

shivaram commented 9 years ago

To make it connect / work with the Spark ec2 cluster you can do something like this

// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

tomerk commented 9 years ago

Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working

— Best Wishes, Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:

To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848

etrain commented 9 years ago

Security groups and/or VPN setup? These are the conventional ways I've seen this handled.

On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan notifications@github.com wrote:

Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working

— Best Wishes, Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:

https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848
— Reply to this email directly or view it on GitHub https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93621104 .

tomerk commented 9 years ago

The security groups were sufficiently open. I could probably figure out some sort of configuration to work, but right now we're focusing on getting Velox to work out of the box w/ little configuration, and going through a spark context makes that a little bit simpler.

— Best Wishes, Tomer Kaftan

On Wed, Apr 15, 2015 at 7:46 PM, Evan Sparks notifications@github.com wrote:

Security groups and/or VPN setup? These are the conventional ways I've seen this handled. On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan notifications@github.com wrote:
Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working

— Best Wishes, Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:

https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848
— Reply to this email directly or view it on GitHub https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93621104 .

Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93622648

dcrankshaw commented 9 years ago

Yeah I was using the Filesystem API. It worked fine when Velox was running on ec2 and could resolve AWS private IP addresses, but when Velox is running outside of ec2 (e.g. on your laptop) we were running into issues. I'm assuming that there is a fix, but it's not the highest priority right now. The other advantage of using Spark is that Spark has already done the work to talk to multiple versions of HDFS. We would have to replicate that work in Velox or only support a single version of hadoop.

tomerk commented 9 years ago

Closed by issue #51

amplab / velox-modelserver

Avoid shipping jar when using spark as a filesystem #49

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?