HDFS support for SolrTextTagger when using EmbeddedSolrServer

mubaldino commented 8 years ago

Hi David, v2.0 still pumping away here at MITRE.

This is a request for an example of how to use STT in a read-only mode in an Hadoop Mapper or Spark situation. The use of EmbeddedSolrServer is crucial there as one would want to minimize network I/O that the Http server gets. However, EmbeddedSolrServer is impossible to get working in a simple Hadoop Mapper.

I wonder if you have encountered any requests for support for SolrTextTagger in BigData environments using this approach? I did see Suit Pal's post on his SODA work -- however that appears to use a bench of RESTFul instances of SolrTextTagger.

... The power we would have if we could deploy SolrTextTagger + EmbeddedSolrServer -- I fired off a 1000 mappers yesterday, each with about 10docs/sec (well, tweets). 10,000 tweets/sec would be good, But,... the Solr mechanics in this situation are impenetrable.

This forum vs. Apache Solr: From the gist of "EmbeddedSolrServer" in the Solr camp, I sense its not well-supported or cared for. So I don't feel posting this as an issue there is worthwhile. The driving force would be SolrTextTagger + EmbeddedSolrServer + BigData scaling. Hence I'm here.

Marc

mubaldino commented 8 years ago

I did find these relatively exhaustive examples: http://www.programcreek.com/java-api-examples/index.php?api=org.apache.solr.core.CoreContainer In these great examples I don't see .register(core) function (appears to be 5.x or 6.x code, still)

SolrResourceLoader() works in my mapper code now, but final result is still. This pointer is not exact, but at least gets deep into the Solr 4.x dev that seems relevant: https://github.com/apache/lucene-solr/blob/dfce8dd7600b69c2d7cd67422c4a32d6702aebe8/solr/core/src/java/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java#L118

I will look at using Solr 5.x with solrtexttagger as next steps. The 4.x release is a bit old now.

close, unless you think there is anything to add. thanks again.

dsmiley commented 8 years ago

Yeah, I'll close -- it's very unlikely I'll ever do what you call for in the outset of the issue -- provide an example of how to use the STT in a Map-Reduce Job. Feel free to contribute this yourself some day :-)

EmbeddedSolrServer isn't going away, and it has in fact improved in ease of use over the years.

mubaldino commented 7 years ago

I wanted to comment on what I did find, because I think it was unique and relevant to running SolrTextTagger (or any use of EmbeddedSolrServer in Hadoop). After a week of researching Hadoop nonsense with Solr, it seems as though running Solr solely within a job is unique -- all situations I have seen are using Solr to index data using HDFS. This tagger context raise some non-obvious issues with Solr and Hadoop jobs. Some key points:

avoid the use of super JAR to run any MapReduce job (e.g., hadoop jar myfat.jar .... )
hadoop jar command requires "-Dmapreduce.job.classloader=true" so that SolrResourceLoader knows it can grab classes from the user's job classpath
define your job classpath with -libjars jar1,jar2,jar3 ## Note -- I'm still testing if ./solr/lib/ sharedLib JARs need to be packed in your solr.solr.home/lib dir in addition to naming them as a libjar item
define your Solr Home as a Zip archive, e.g., -Dsolr.solr.home=./mycores.zip/folder, From there you assume that the folder ./mycores.zip/folder/solr.xml is where solr.xml is found and any cores are discovered from that 'folder'. If 'folder' is omitted, then just the top level of the .zip is solr home.
Along with defining solr home you push your content to DistCache ahead of time (hdfs dfs -put -f mycores.zip hdfs:///some/safeplace/mycores.zip). Add the the -archives hdfs:///some/safeplace/mycores.zip to your jar invocation

Altogether now:

# Context:  My Mapper (or Reducer) is some SolrTextTagger app where I want to tag each 
# record of data in my HDFS archive input.   Alternatively, this could be a recipe for 
# any situation where I want to use a Solr index as a query tool or a lookup, e.g., 
#     Given data input X, do geospatial query against X and emit some result.
# It might not be obvious from the use case, but bares noting that the Solr index 
# used in these situations is strictly READ-ONLY;  this is for tagging and query operations only.
# BASH SCRIPTING:
# Ahead of time: Copy all dependent JARs to ./jars/;  Deconflict and remove any Hadoop libraries.
str=`ls  jars/*jar`
JARS=`echo $str | sed -e 's: :,:g;'`
# JARS=jar1,jar2,jar3,...
hadoop jar myJob.jar  \
       -Dmapreduce.job.classloader=true \
       -Dmapreduce.map.java.opts="-Dsolr.solr.home=./mycores.zip -Xmx512m" \
       -Dmapreduce.map.memory.mb=512 \
       -Dmapreduce.job.cache.archives=hdfs:///some/safeplace/mycores.zip \
       -libjars $JARS

OpenSextant / SolrTextTagger

HDFS support for SolrTextTagger when using EmbeddedSolrServer #59