basho / yokozuna

Riak + Solr
245 stars 76 forks source link

Solr Cell Integration [JIRA: RIAK-1714] #17

Open rzezeski opened 12 years ago

rzezeski commented 12 years ago

Solr Cell is the integration of Apache Tika with Solr. It allows Solr to index rich document formats like HTML, PDF, Microsoft Office documents. It does this by providing a request handler resource, called ExtractingRequestHandler. It takes a document as input, feeds it to Tika, and uses SAX to produce events that are created into a Solr document suitable for indexing. Solr provides several URL parameters to control this process such as literal.<fieldname>=<value> which allows a field-value to be added to the created Solr document.

It's not immediately obvious how to integrate Solr Cell. It's different from the current index path in Yokozuna. A different HTTP resource is used and Solr doc creation happens on the Solr side, not in Yokozuna. But Yokozuna currently assumes that all data must be extracted, turned into a Solr doc, and then sent as an update message. Perhaps this is a sign that the extractor abstraction is too narrow for all use cases? The main question to answer is:

How should Yokozuna differentiate between data that should go through the extraction process versus data that should be passed to Solr Cell?

coderoshi commented 12 years ago

Here are a couple options to branch extraction versus Solr Cell:

  1. register a mime type to solr cell, similar to how extractors are registered to mime types
  2. pass in a special header that says to use solr cell, eg. X-Riak-Extractor:solrcell
  3. Allow passing in the literal.<fieldname>=<value> parameters, and if they exist, use solrcell

I tend to like the first option best, from a usability standpoint, though I think no matter if 1, 2 or an alternative are chosen, allowing the literal params will still be required.

rzezeski commented 12 years ago

The problem with a special mime-type is that it obfuscates the real mime-type of the data. If someone stores a PDF in Riak they want to store it with content-type of 'application/pdf' (and its variations). Specifying the extractor might work but that would require a change to KV and what does it mean to have an extractor for Solr Cell? Solr Cell is the extractor, so why even go through the extraction process at all?

At the moment I'm thinking Solr Cell is fairly different path in Yokozuna. As a strawman, yz_kv would somehow know the incoming object needs to be indexed by Solr Cell. It would need to create the special fields and pass them via literal fields. Then it would make a call to the Solr Cell resource. Once again, I think the key is determining when a request should go to Solr Cell that is the most obvious and friendly for users. I think something with the content-type is probably the right start. I'm just curious about cases where a user writes a PDF but doesn't want it processed by Solr Cell. Perhaps that's just change the mapping to noop for that content-type. But the "mapping" is content-type => extractor. But extractor only extracts fields. It's not a high-enough abstraction to deal with Solr Cell. Perhaps part of extraction is returning how Yokozuna should index it. But perhaps that goes a bit beyond the idea of extraction. Not sure, just thinking out loud.

coderoshi commented 12 years ago

I didn't say a "special mime-type". I said register a mime-type to use Solr Cell. From an interface PoV, the end-user wouldn't know/care what form of extraction was set up for them. The operator sets up "application/pdf" to use Solr Cell, and the users just uploads PDF.

I agree that Solr Cell is functionally a different path. But I'm not sure that, beyond registering a mime-type to use solr cell (and possibly some other configs at the moment of registration), the end user needs to know the details of how it's being indexed.

siculars commented 12 years ago

I like @coderoshi's option 2.

pass in a special header that says to use solr cell, eg. X-Riak-Extractor:solrcell

use a special riak header as a flag.

DSomogyi commented 9 years ago

Comment for Jira.