ScaleUnlimited / cascading.solr

Cascading scheme for Solr
27 stars 13 forks source link

Add support to read Solr schema from S3 #8

Open erasmas opened 10 years ago

erasmas commented 10 years ago

In cascading.solr 2.5.0 Solr schema should be read from local filesystem, it would be convient to provide possibility to read Solr schema from S3 and HDFS. This will allow Cascalog/Cascading jobs to be executed on Amazon EMR.

I have the following use case. I'm trying to submit the job to EMR where the input data is read from S3 and the output is Solr index written to S3 bucket. But due to cascading.solr limitation which is that schema should be stored on local FS this is not possible at the moment. It's still doable since I can bootstrap an EMR cluster copy schema onto the cluster and finally execute my job. I plan to work on this since it will greatly simplify this process. Let me know about any concerns you might have related to this issue. Thanks!

erasmas commented 10 years ago

Guys, I just realized looking into the code that Solr core directory is being copied from local FS to temp directory on HDFS. Was there some special reason for that instead of reading it directly from HDFS?

kkrugler commented 10 years ago

Because usually when you run a job, the schema is something that is coupled to the workflow, so keeping them together (the workflow starts locally) is appropriate. And we need the schema locally to be able to instantiate the schema fields, and thus provide the sink fields that Cascading needs to validate the workflow.

But yes, having an option to read from HDFS would be useful for some situations.

For EMR, we have a custom bootstrap action that loads the Solr schema so you can run the workflow.