Open rzezeski opened 12 years ago
Here are a couple options to branch extraction versus Solr Cell:
X-Riak-Extractor:solrcell
literal.<fieldname>=<value>
parameters, and if they exist, use solrcellI tend to like the first option best, from a usability standpoint, though I think no matter if 1, 2 or an alternative are chosen, allowing the literal
params will still be required.
The problem with a special mime-type is that it obfuscates the real mime-type of the data. If someone stores a PDF in Riak they want to store it with content-type of 'application/pdf' (and its variations). Specifying the extractor might work but that would require a change to KV and what does it mean to have an extractor for Solr Cell? Solr Cell is the extractor, so why even go through the extraction process at all?
At the moment I'm thinking Solr Cell is fairly different path in
Yokozuna. As a strawman, yz_kv
would somehow know the incoming
object needs to be indexed by Solr Cell. It would need to create the
special fields and pass them via literal fields. Then it would make a
call to the Solr Cell resource. Once again, I think the key is
determining when a request should go to Solr Cell that is the most
obvious and friendly for users. I think something with the
content-type is probably the right start. I'm just curious about
cases where a user writes a PDF but doesn't want it processed by Solr
Cell. Perhaps that's just change the mapping to noop for that
content-type. But the "mapping" is content-type => extractor. But
extractor only extracts fields. It's not a high-enough abstraction to
deal with Solr Cell. Perhaps part of extraction is returning how
Yokozuna should index it. But perhaps that goes a bit beyond the idea
of extraction. Not sure, just thinking out loud.
I didn't say a "special mime-type". I said register a mime-type to use Solr Cell. From an interface PoV, the end-user wouldn't know/care what form of extraction was set up for them. The operator sets up "application/pdf" to use Solr Cell, and the users just uploads PDF.
I agree that Solr Cell is functionally a different path. But I'm not sure that, beyond registering a mime-type to use solr cell (and possibly some other configs at the moment of registration), the end user needs to know the details of how it's being indexed.
I like @coderoshi's option 2.
pass in a special header that says to use solr cell, eg. X-Riak-Extractor:solrcell
use a special riak header as a flag.
Comment for Jira.
Solr Cell is the integration of Apache Tika with Solr. It allows Solr to index rich document formats like HTML, PDF, Microsoft Office documents. It does this by providing a request handler resource, called ExtractingRequestHandler. It takes a document as input, feeds it to Tika, and uses SAX to produce events that are created into a Solr document suitable for indexing. Solr provides several URL parameters to control this process such as
literal.<fieldname>=<value>
which allows a field-value to be added to the created Solr document.It's not immediately obvious how to integrate Solr Cell. It's different from the current index path in Yokozuna. A different HTTP resource is used and Solr doc creation happens on the Solr side, not in Yokozuna. But Yokozuna currently assumes that all data must be extracted, turned into a Solr doc, and then sent as an update message. Perhaps this is a sign that the extractor abstraction is too narrow for all use cases? The main question to answer is: