Open realdigger opened 4 years ago
+1 Himself faced this today With a default config is not looking for Russian words
It would be better if this charset_table would detect if your using UTF8 or not. If using UTF-8 it should build a charset safe for UTF-8.
@Yariksat and @realdigger
Can you guys try the following in your configs. See if this gets support working.
charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \
U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \
\
U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\
U+0E01..U+0E3A, U+0E3F..U+0E46,\
U+0E47..U+0E4F, U+0E50..U+0E5B,\
U+A000..U+A48F,\
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\
U+A490..U+A4CF, \
\
U+410..U+42F->U+430..U+44F, U+430..U+44F, \
\
U+621..U+63a, U+640..U+64a, U+66e..U+66f, \
U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \
U+6fa..U+6fc, U+6ff
ngram_len = 1
ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \
U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \
U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \
U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \
U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \
U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \
U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \
U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \
U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \
U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \
U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \
U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \
U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \
U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \
U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
@Yariksat and @realdigger
Can you guys try the following in your configs. See if this gets support working.
charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \ U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \ \ U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\ U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\ U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\ U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\ U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\ U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\ U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\ U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\ U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\ U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\ U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\ U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\ U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\ U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\ U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\ U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\ U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\ U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\ U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\ U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\ U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\ U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\ U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\ U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\ U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\ U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\ U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\ U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\ U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\ U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\ U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\ U+0E01..U+0E3A, U+0E3F..U+0E46,\ U+0E47..U+0E4F, U+0E50..U+0E5B,\ U+A000..U+A48F,\ U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\ U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\ U+A490..U+A4CF, \ \ U+410..U+42F->U+430..U+44F, U+430..U+44F, \ \ U+621..U+63a, U+640..U+64a, U+66e..U+66f, \ U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \ U+6fa..U+6fc, U+6ff ngram_len = 1 ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \ U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \ U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \ U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \ U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \ U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \ U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \ U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \ U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \ U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \ U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \ U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \ U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \ U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \ U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \ U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \ U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
Thank. It works, I will observe
It could worth to have a look at https://github.com/phpbb/phpbb/pull/5815. Just note that the phpBB implementation is using the older SphinxAPI instead of SphinxQL.
Just to clarify, they have a different licensing model which means we can't borrow any code from them without it being noncompliant with the GPL and 3 Clause BSD. However the configuration of the Sphinx is not something that would be considered licensable. The difference between the API and QL usage doesn't matter here as this is more how Sphinx is setup to process the data.
Looking at what they set, the charset looks very familiar with some different options elsewhere that we are not setting. If we are finding those options are needed or fix other issues, they should be added. I think the requested test for the updated charset should get things rolling properly.
charset_table = 0..9, A..Z->a..z, _, a..z
Would be better to remove this line or make it optional, because it limit index to latin charset only. Default value for charset_table are latin and cyrillic characters.
@realdigger Here is good config for sphinxserach indexes creation: https://github.com/anilibria/docs/blob/master/install/sphinx.md https://adw0rd.com/2009/6/15/sphinxsearch/
So after looking into this and thinking. I think the best solution is to
The config is supposed to be hints to get things working, not be a OOB solution.
http://sphinxsearch.com/docs/current/conf-charset-table.html http://sphinxsearch.com/wiki/doku.php?id=charset_tables https://manticoresearch.com/blog/manticore-search-3-years-after-forking-from-sphinx/ https://manticoresearch.com/blog/default-charset-tables-and-stopwords-files/
charset_table = 0..9, A..Z->a..z, _, a..z
Would be better to remove this line or make it optional, because it limit index to latin charset only. Default value for charset_table are latin and cyrillic characters.