SimpleMachines / sphinx-for-smf

Sphinx Search Engine for SMF
Other
8 stars 7 forks source link

charset_table #16

Open realdigger opened 4 years ago

realdigger commented 4 years ago

charset_table = 0..9, A..Z->a..z, _, a..z

Would be better to remove this line or make it optional, because it limit index to latin charset only. Default value for charset_table are latin and cyrillic characters.

Yariksat commented 4 years ago

+1 Himself faced this today With a default config is not looking for Russian words

jdarwood007 commented 4 years ago

It would be better if this charset_table would detect if your using UTF8 or not. If using UTF-8 it should build a charset safe for UTF-8.

jdarwood007 commented 4 years ago

@Yariksat and @realdigger

Can you guys try the following in your configs. See if this gets support working.

charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \
        U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \
\
U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\
U+0E01..U+0E3A, U+0E3F..U+0E46,\
U+0E47..U+0E4F, U+0E50..U+0E5B,\
U+A000..U+A48F,\
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\
U+A490..U+A4CF, \
\
U+410..U+42F->U+430..U+44F, U+430..U+44F, \
\
U+621..U+63a, U+640..U+64a, U+66e..U+66f, \
U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \
U+6fa..U+6fc, U+6ff

ngram_len = 1
ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \
    U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \
    U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
    U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \
    U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \
    U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \
    U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \
    U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \
    U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \
    U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \
    U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \
    U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \
    U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
    U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \
    U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \
    U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \
    U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
Yariksat commented 4 years ago

@Yariksat and @realdigger

Can you guys try the following in your configs. See if this gets support working.

charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \
      U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \
\
U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\
U+0E01..U+0E3A, U+0E3F..U+0E46,\
U+0E47..U+0E4F, U+0E50..U+0E5B,\
U+A000..U+A48F,\
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\
U+A490..U+A4CF, \
\
U+410..U+42F->U+430..U+44F, U+430..U+44F, \
\
U+621..U+63a, U+640..U+64a, U+66e..U+66f, \
U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \
U+6fa..U+6fc, U+6ff

ngram_len = 1
ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \
  U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \
  U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
  U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \
  U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \
  U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \
  U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \
  U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \
  U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \
  U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \
  U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \
  U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \
  U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
  U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \
  U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \
  U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \
  U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6

Thank. It works, I will observe

iasdeoupxe commented 4 years ago

It could worth to have a look at https://github.com/phpbb/phpbb/pull/5815. Just note that the phpBB implementation is using the older SphinxAPI instead of SphinxQL.

jdarwood007 commented 4 years ago

Just to clarify, they have a different licensing model which means we can't borrow any code from them without it being noncompliant with the GPL and 3 Clause BSD. However the configuration of the Sphinx is not something that would be considered licensable. The difference between the API and QL usage doesn't matter here as this is more how Sphinx is setup to process the data.

Looking at what they set, the charset looks very familiar with some different options elsewhere that we are not setting. If we are finding those options are needed or fix other issues, they should be added. I think the requested test for the updated charset should get things rolling properly.

iCr commented 3 years ago

charset_table = 0..9, A..Z->a..z, _, a..z

Would be better to remove this line or make it optional, because it limit index to latin charset only. Default value for charset_table are latin and cyrillic characters.

@realdigger Here is good config for sphinxserach indexes creation: https://github.com/anilibria/docs/blob/master/install/sphinx.md https://adw0rd.com/2009/6/15/sphinxsearch/

jdarwood007 commented 1 year ago

So after looking into this and thinking. I think the best solution is to

The config is supposed to be hints to get things working, not be a OOB solution.

http://sphinxsearch.com/docs/current/conf-charset-table.html http://sphinxsearch.com/wiki/doku.php?id=charset_tables https://manticoresearch.com/blog/manticore-search-3-years-after-forking-from-sphinx/ https://manticoresearch.com/blog/default-charset-tables-and-stopwords-files/