Closed tomerk closed 9 years ago
@shivaram mentioned that it's fine to use spark to read to & from a filesystem, but he recommended just using a local spark context and connecting to the filesystem on the remote spark cluster instead of connecting a spark context to the remote spark cluster. This should solve this issue because we wouldn't need to ship any jars.
BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?
We were trying to use that before (at least I think that's what Dan was using), but there was much more effort required to get it configured and working correctly when connecting to a spark ec2 cluster.
— Best Wishes, Tomer Kaftan
On Wed, Apr 15, 2015 at 5:56 PM, Shivaram Venkataraman notifications@github.com wrote:
BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?
Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93608398
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working
— Best Wishes, Tomer Kaftan
On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark val config = new Configuration(true) // Call config.addResource() if required val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848
Security groups and/or VPN setup? These are the conventional ways I've seen this handled.
On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan notifications@github.com wrote:
Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working
— Best Wishes, Tomer Kaftan
On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark val config = new Configuration(true) // Call config.addResource() if required val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:
https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848
— Reply to this email directly or view it on GitHub https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93621104 .
The security groups were sufficiently open. I could probably figure out some sort of configuration to work, but right now we're focusing on getting Velox to work out of the box w/ little configuration, and going through a spark context makes that a little bit simpler.
— Best Wishes, Tomer Kaftan
On Wed, Apr 15, 2015 at 7:46 PM, Evan Sparks notifications@github.com wrote:
Security groups and/or VPN setup? These are the conventional ways I've seen this handled. On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan notifications@github.com wrote:
Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working
— Best Wishes, Tomer Kaftan
On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman notifications@github.com wrote:
To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark val config = new Configuration(true) // Call config.addResource() if required val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:
https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93620848
— Reply to this email directly or view it on GitHub https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93621104 .
Reply to this email directly or view it on GitHub: https://github.com/amplab/velox-modelserver/issues/49#issuecomment-93622648
Yeah I was using the Filesystem API. It worked fine when Velox was running on ec2 and could resolve AWS private IP addresses, but when Velox is running outside of ec2 (e.g. on your laptop) we were running into issues. I'm assuming that there is a fix, but it's not the highest priority right now. The other advantage of using Spark is that Spark has already done the work to talk to multiple versions of HDFS. We would have to replicate that work in Velox or only support a single version of hadoop.
Closed by issue #51
When treating spark as a general file system to read & write into hdfs, s3, etc. (e.g. writing observations & reading user weights following a retrain), avoid shipping the JAR! May not be naively possible in some cases (e.g. user-defined contexts).
Longer-term, this would be fixed by issue #48, and by writing to/from HDFS & other filesystems depending on how closely we decide to tie Velox to specific spark cluster configurations & file destinations