Closed jim2 closed 7 years ago
For Spark, we upload the bfmap
file to HDFS. This allows the Spark executors read the map from HDFS, which should be better than having them all connecting to the PostgreSQL database at the same time.
To create a bfmap
file from the PostgreSQL database, use this code: https://gist.github.com/jongiddy/67c7ace4e7394e1e5f3bea978ddf74ec (this is set to run inside a Vagrant virtual machine, but changing the hardwired /vagrant
paths will make it suitable for other environments).
To read the bfmap
file from HDFS, we created a HadoopMapReader
: https://gist.github.com/jongiddy/b68be517274a424df84d2bea4cdd6354
And our BroadcastMatcher then looks like this (although I have edited out some application-specific code): https://gist.github.com/jongiddy/286857e09f9881854a725634ca82b515
@jongiddy Thanks, for sharing that! BTW: ... double-checked locking, you could have let me know. ;)
@smattheis No worries! To be clear, we never saw a problem caused by the locking. I think I added that while debugging a thread-safety issue that actually occurred in a part of my code.
@jongiddy @smattheis thanks guys this is awesome
@jim2 I'm glad it helps. If you any more question, feel free to keep this issue open, or open another one. Once the question/problem regarding spark is resolved for you, please close the issue.
@jongiddy Alright. Anyways, your reference in the comment explains that it's a code smell. I never knew that.
I'm closing this as it seemed to be resolved.
Hi there - I have the test working w the docker container and the java server/python script. My next test is to get it working w spark. Do you have a working example of BroadcastMatcher.scala (I'm trying to get things to work w your sample scala code block but I think I need to import a few classes and I want to make sure I'm referencing the right configfile etc). It looks like the broadcast is pointing to the postGIS db to pull the osm data to each node in my spark cluster?
Thanks for any help/advice. Great project!