Closed mubaldino closed 8 years ago
I did find these relatively exhaustive examples: http://www.programcreek.com/java-api-examples/index.php?api=org.apache.solr.core.CoreContainer In these great examples I don't see .register(core) function (appears to be 5.x or 6.x code, still)
SolrResourceLoader() works in my mapper code now, but final result is still. This pointer is not exact, but at least gets deep into the Solr 4.x dev that seems relevant: https://github.com/apache/lucene-solr/blob/dfce8dd7600b69c2d7cd67422c4a32d6702aebe8/solr/core/src/java/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java#L118
I will look at using Solr 5.x with solrtexttagger as next steps. The 4.x release is a bit old now.
close, unless you think there is anything to add. thanks again.
Yeah, I'll close -- it's very unlikely I'll ever do what you call for in the outset of the issue -- provide an example of how to use the STT in a Map-Reduce Job. Feel free to contribute this yourself some day :-)
EmbeddedSolrServer isn't going away, and it has in fact improved in ease of use over the years.
I wanted to comment on what I did find, because I think it was unique and relevant to running SolrTextTagger (or any use of EmbeddedSolrServer in Hadoop). After a week of researching Hadoop nonsense with Solr, it seems as though running Solr solely within a job is unique -- all situations I have seen are using Solr to index data using HDFS. This tagger context raise some non-obvious issues with Solr and Hadoop jobs. Some key points:
Altogether now:
# Context: My Mapper (or Reducer) is some SolrTextTagger app where I want to tag each
# record of data in my HDFS archive input. Alternatively, this could be a recipe for
# any situation where I want to use a Solr index as a query tool or a lookup, e.g.,
# Given data input X, do geospatial query against X and emit some result.
# It might not be obvious from the use case, but bares noting that the Solr index
# used in these situations is strictly READ-ONLY; this is for tagging and query operations only.
# BASH SCRIPTING:
# Ahead of time: Copy all dependent JARs to ./jars/; Deconflict and remove any Hadoop libraries.
str=`ls jars/*jar`
JARS=`echo $str | sed -e 's: :,:g;'`
# JARS=jar1,jar2,jar3,...
hadoop jar myJob.jar \
-Dmapreduce.job.classloader=true \
-Dmapreduce.map.java.opts="-Dsolr.solr.home=./mycores.zip -Xmx512m" \
-Dmapreduce.map.memory.mb=512 \
-Dmapreduce.job.cache.archives=hdfs:///some/safeplace/mycores.zip \
-libjars $JARS
Hi David, v2.0 still pumping away here at MITRE.
This is a request for an example of how to use STT in a read-only mode in an Hadoop Mapper or Spark situation. The use of EmbeddedSolrServer is crucial there as one would want to minimize network I/O that the Http server gets. However, EmbeddedSolrServer is impossible to get working in a simple Hadoop Mapper.
I wonder if you have encountered any requests for support for SolrTextTagger in BigData environments using this approach? I did see Suit Pal's post on his SODA work -- however that appears to use a bench of RESTFul instances of SolrTextTagger.
... The power we would have if we could deploy SolrTextTagger + EmbeddedSolrServer -- I fired off a 1000 mappers yesterday, each with about 10docs/sec (well, tweets). 10,000 tweets/sec would be good, But,... the Solr mechanics in this situation are impenetrable.
This forum vs. Apache Solr: From the gist of "EmbeddedSolrServer" in the Solr camp, I sense its not well-supported or cared for. So I don't feel posting this as an issue there is worthwhile. The driving force would be SolrTextTagger + EmbeddedSolrServer + BigData scaling. Hence I'm here.
Marc