basho / yokozuna

Riak + Solr
245 stars 76 forks source link

Multilingual text indexation [JIRA: RIAK-2439] #620

Open Guibod opened 8 years ago

Guibod commented 8 years ago

Hi guys,

Is there a proper way to define an index with language specific stemming and tokenization on a single field and a single index ?

I'm struggling to find a proper solution that is Riak compatible, but nothing seems clear to me.

Here is the copy of my Stackoverflow question

I want to store multilanguage (for illustration purpose english, french, spanish, but that's much more) in Riak, I want to use Riak search to help me grouping, stemming, tokenizing the text values.

In my Schema.yml i have:

<field name="text" type="string" indexed="true" stored="true" multiValued="false"/>

And :

<fieldType name="text_en" class="solr.TextField" />
<fieldType name="text_es" class="solr.TextField" />
<fieldType name="text_fr" class="solr.TextField" />

Each fieldType enable language specific optimisation. There is no DynamicFieldType in Solr, as stated in this other help request at stackoverflow: http://stackoverflow.com/questions/23747373/solr-dynamic-field-types

As suggested above I have three solutions:

Separate field

Would force me to store each data in different fields in my Riak document. That's not scalable up to 20 or more languages.

    <field name="text_en" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_es" type="text_en" indexed="true" stored="true" multiValued="false"/>
    <field name="text_fr" type="text_en" indexed="true" stored="true" multiValued="false"/>

Separate indexes

That's pretty simple, I can configure my Solr index for a given language, keep only one field. That's an interesting solution since it will allow me a language sharding that's pretty convenient or maintenance.

BUT that imply that I cannot search across multiple languages anymore since I can't find multi-index search feature in my python library or in the documentation.

Custom code

Which I don't understand, most probably start my own java class that can handle my case. That's clearly NOT my preference.

Is there another way around this problem ?

zeeshanlakhani commented 8 years ago

I'd suggest taking on the separate field approach @Guibod, but making sure to distribute that search index across various types/buckets. Did you try that and run into a bottleneck after 20 langs? Across indexing or querying? As per something like http://pavelbogomolenko.github.io/multi-language-handling-in-solr.html, but we can discuss how to best tune your configuration on your needs/expectations?

Guibod commented 8 years ago

Thanks @zeeshanlakhani , my only issue with the sharding per lang is that I don't know how to search across multiples indexes for the time being. I'm pretty new to Solr, and rely a lot on the python library at the moment. I can pretty easily store data in separate buckets/bucket types/indexes, each of them can be fine tuned with a proper analyzer. But I don't know how to query across multiples indexes.

See: http://basho.github.io/riak-python-client/query.html#querying-an-index

# Python API explicitly requires ONE index
results = bucket.search("counter:[10 TO *]", index='website',
                        sort="counter desc", rows=5)

Should I use map/reduce on the search results ? If so, how can I do that ? Should I extend the current API with some Solr magic trick such as multiple index query ?

zeeshanlakhani commented 8 years ago

@Guibod you can't do multi-index search w/ riak search, but I was wondering what your bottlenecks would look like using one search_index/core, but creating a bucket-type per lang (associating each bucket-type w/ the one search_index).

Guibod commented 8 years ago

The main issue is that I would be stuck with mono-lingual search. I want to setup proper indexation per language (using string_en, string_fr), and then allow multi-lingual search.

Most of the time i will aggregate data for data visualisation, I can map/reduce results from Riak into a proper aggregation by my own means. But in some case, i'll need to show off ordered content in detail. This is gonna be really painful to code in my API, I'd gladly rely on cross-index search rather than searching individual indexes, and sorting the results myself.