Open GoogleCodeExporter opened 9 years ago
Original comment by alex.zif...@gmail.com
on 31 Aug 2009 at 7:25
The cause of seemingly wrong behaviour of the _tcsnicmp function is that it
depends
on the current locale, which is set to "C" locale by default, and needs to be
changed by setlocale to be used by numerous c run-time functions that depend on
it.
Seems like Miranda doesn't do it, and if it uses those functions it will sooner
or
later suffer from being non-localizable in a sense. And it seems impractical to
use
locale-dependent functions to handle unicode strings that should be treated in
a
prescribed way (like the STRINGPREP spec for IDN). It's a possibility that
there
will be users from different countries in a contact list that have letters from
different languages in their jids. And in such an event, even using
CompareString
(that also depends on locale) or setlocale won't solve the problem. My propose
is
move to GNU libidn or some other suitable library in a future release.
Original comment by mikekaga...@gmail.com
on 1 Sep 2009 at 6:29
You are incorrect Miranda core does change current locale to C++ one so it
should
handle Unicode chars correctly although if you do mix compilers (and this is
what you
do) this not going to work. But formal distribution have everything compiled
with the
same compiler, so there is no problem.
And BTW Windows lstrcmpi is locale independent.... So no need for GNU stuff.
Original comment by borkra
on 1 Sep 2009 at 9:12
I disagree.
1. [quote]Miranda core does change current locale to C++[/quote] - what does it
mean? There's no locale in C standard nor C++ standard called "C++". C locale
is one
that is used in C (and C++) alphabet definition, and as such, it knows about
ASCII
characters only (0..127). C++ has its own representation of the locale concept,
naturally it's a class with a broader responsibilities, but the locale names
(and
meanings) are the same - the default "C++ locale" is "C". And naturally, it's
vital
to either use only those functions that depend on one chosen representation (C
or
C++, as they are independent) or set them both to same value.
2. [quote]if you do mix compilers (and this is what you
do)[/quote] - I use MS Visual Studio 2008 (and the default config converted
from .dsw provided in the source).
3. [quote]And BTW Windows lstrcmpi is locale independent[/quote] -
http://msdn.microsoft.com/en-us/library/ms647489(VS.85).aspx (note I'm quoting
Microsoft): "The function calls CompareString, _using_ _the_ _current_ _thread_
_locale_" (underscore is mine). So it's locale-dependent (but again, I haven't
argued against lstrcmpi or CompareString in the beginning, in fact, I fixed the
code
with CompareString that is the work horse of lstrcmpi). What I have said is the
ORIGINAL code of Miranda uses different functions to do the same things - this
time,
case insensitive string comparison.
4. [quote]So no need for GNU stuff[/quote] - Well, I'm not sure if your post is
against GNU, but the library I propose is under LGPL that is compatible with
your
license, and it's well documented. The idea here is that it's possible for me
to add
to my contact list people from over the globe, all possibly having localized
jids as
the standard permits (one may have hyeroglyphs, another may be arabic, yet
another
could have umlauts etc.), and when Miranda will compare case-insensitively the
jid
of an incoming message to jid in its contact list using any function that
depends on
a locale, it will inevitably get false-negative answers, thus adding extra
copies of
contacts. Naturally enough, these events are exceptionally rare now as there
not too
many localized jids in the world now. But as the protocol will (I hope) become
more
popular, these things will become more frequent, so it's better to prepare and
be
standards-complying.
I'm not saying this is the first priority, but some future version should do
IDN
processing in the way the Standard prescribes. I just got curious about how
this
SHOULD be done, and yesterday I run across this library - I haven't yet
implemented
the libidn support in Miranda, but when (if) I will I'll post the result here.
Original comment by mikekaga...@gmail.com
on 1 Sep 2009 at 10:06
Mike, that was the very interesting passage (especially about people from all
over
the globe), but Miranda already uses unicode mapping to compare unicode strings.
First of all, Miranda resets the locale, as described in the MSDN:
setlocale(LC_ALL,"C"); // in effect by default
printf("\n%d",_wcsicmp(L"ä", L"Ä")); // compare fails
setlocale(LC_ALL,"");
printf("\n%d",_wcsicmp(L"ä", L"Ä")); // compare succeeds
You can find the call of setlocale(LC_ALL,"") yourself, it's easy.
So after that wcsicmp will use the built-in table for Unicode chars with codes
0..255, and LCMapString for all another symbols. It doesn't make any difference
from
your code.
2. The situation with mixed-case jids is successfully tested on many servers,
mainly
on various ejabberd clones. It definitely works there.
3. If you will provide a network log with such a problem, when Miranda creates a
duplicate contact, your help will be appreciated.
4. You better don't use the converted DSW's, cause it may create a lot of
problems.
Use the native VS2008 projects & solutions, you will find them in the /bin9
folder.
Original comment by george.hazan
on 1 Sep 2009 at 10:52
"3. [quote]And BTW Windows lstrcmpi is locale independent[/quote] -
http://msdn.microsoft.com/en-us/library/ms647489(VS.85).aspx (note I'm quoting
Microsoft): "The function calls CompareString, _using_ _the_ _current_ _thread_
_locale_" (underscore is mine)."
mikekaganski, you did not understand what Microsoft is talking about.
CompareString is locale independent for as much as it can be. But if for you
European
languages relationship between upper and lower case letters is 1 to 1, this is
not
true for Asian languages where for the same lower case letter there could be
different upper case letter depending on your language. Hieroglyphic writing
have
completely different set of rules :). So in such cases knowledge of locale is
essential, to make proper conversion.
I've been using lstrcmpi for a few years in my plugins and know very well that
does
work well for the complete Unicode range regardless of locale you are in.
Original comment by borkra
on 2 Sep 2009 at 12:21
2george.hazan
Thank you for finding my writing interesting :)
Yes, I found the code you mentioned (setlocale(LC_ALL,"")) in the very
beginning of
the Miranda's WinMain function. And this is the only place in Miranda where it
can
be found.
Yes, I agree that calling it will solve my case (as I wrote in the Comment 2).
But
still I'm afraid that Miranda's Jabber dll suffers from having its locale set
to "C". To check if it's true I opened Miranda project (this time I did as you
told,
from native solution, but from bin8 directory), opened file
miranda\protocols\JabberG\jabber.cpp, and added the next lines to the DllMain
in the
very beginning:
const TCHAR tststr[] = _T("абвгдеёжзийклмнопрстуфхцчшщъыьэюя"); // Russian
alphabet lowercase
const TCHAR TSTSTR[] = _T("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"); // Russian
alphabet uppercase
int mk_result = _tcsnicmp(tststr, TSTSTR, -1);
MessageBox(0, (mk_result == 0) ? _T("Equal") : _T("Inequal"), _T
("Comparison"), MB_OK | MB_ICONINFORMATION | MB_TOPMOST);
And when I compile this dll (Release Unicode configuration, nothing else
changed)
and run Miranda with it I get message saying "Inequal"! I use Russian Windows
XP
with Regional options set to Russian.
I suppose it's because the rtl is stateful, and it uses some global variables
(e.g.
__locale_changed) that are global only in a project, not across different
projects
of a solution. So when the .dll opens, it has its own locale different from the
.exe
(I don't know it for sure, it's just guessing). Anyway, this test clearly shows
that
the _tcsnicmp doesn't work the way it's expected to.
I sertainly will try to get the network log of the case You asked (but I don't
understand in what way the network log may be useful here).
2borkra
Yes, I understand what you are talking about (and what Microsoft is, too). I
just
want to make clear what is my point:
Yes, if you change the code that it will use lstrcmpi instead of tcsnicmp, it
will
fix MOST of problems (and all my problems; and I did just this, but using the
underlying CompareString!). But it will fix only those cases where the jid
contains
only those characters that have one-to-one upper-to-lowercase relations (this
is
exactly what you said, I agree with you). But when this will not be the case
(as you
justly said about Asian languages), the approach will fail and the comparison
will
become locale-dependent (ant if it would concern a message, it would be OK, but
it
is all about jid, that must be locale-independent and must be treated as
standard
says). I don't think this problem is high-priority; I just say that there IS a
problem that MAY become prominent in the future.
Thank you for your attention to this problem.
Original comment by mikekaga...@gmail.com
on 2 Sep 2009 at 3:34
1. No, you did not understand, what I am saying.
For these Asian languages it's impossible to make locale independent compare no
matter
what you do. As upper/lower relationship is one to many. And Microsoft API does
treat
the comparison the way standard says.
2. If you will not recompile anything with different compiler and use Miranda s
it
provided, do you still have a problem?
Original comment by borkra
on 2 Sep 2009 at 7:26
Mike,
VS2008 stores all non-ASCII strings as UTF8 by default, that's why they are
unequal
:) just look at these strings in the debugger. Another idea is that you
compiled your
dll with static runtime, that's why the setlocale() call inside the code doesn't
affect Jabber.
Original comment by george.hazan
on 2 Sep 2009 at 9:03
2borkra
Boris, I think you are wrong, but I have not enough klowledge in this field at
the
moment. I have just idea, not proofs.
I cannot use Miranda in my environment without recompilation of Jabber, as it
just
won't authorise me (as described in Issue 187). So I cannot answer your
question at
the moment.
2george.hazan
I'm sorry to say that, but it's impossible. TCHAR compiles to wchar_t if
UNICODE is
defined (and it is in the Release Unicode configuration), and wchar_t in the VC
2008
(as in most other implementations) is 2-byte type. The text that is stored in a
wchar_t[] constant is encoded UTF-16. UTF-8 is multibyte encoding, i.e. a char
may
be represented in arbitrary count of bytes, it may be 1 byte (in the case of
ascii)
or 2 bytes (as is true for any russian letter) or up to 5 bytes. The test that
is
utf-8-encoded is stored in char[].
Another thing is that I have not changed anything in the project settings, as I
mentioned. Ant there stays "Multi-threaded DLL (/MD)" for Runtime library, and
the
dll that is generated has size 670208 B, even less then the one that ships in
the
official package (697963 B). If it would compile statically it would be
significantly larger.
Do you say that if you compile it yourself you will get different results? :)
Original comment by mikekaga...@gmail.com
on 2 Sep 2009 at 9:59
Oh, I forgot to say that if I add setlocale() before the strings in DllMain it
compares correctly, so the code
const TCHAR tststr[] = _T("абвгдеёжзийклмнопрстуфхцчшщъыьэюя");
const TCHAR TSTSTR[] = _T("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ");
int mk_result = _tcsnicmp(tststr, TSTSTR, -1);
MessageBoxA(0, (mk_result == 0) ? "Eq" : "NEq", setlocale(LC_ALL, NULL),
MB_OK | MB_ICONINFORMATION);
setlocale(LC_ALL, "");
mk_result = _tcsnicmp(tststr, TSTSTR, -1);
MessageBoxA(0, (mk_result == 0) ? "Eq" : "NEq", setlocale(LC_ALL, NULL),
MB_OK | MB_ICONINFORMATION);
brings the first message box saying "NEq" and having the caption "C", and the
second
saying "Eq" and having the caption "Russian_Russia.1251".
Original comment by mikekaga...@gmail.com
on 2 Sep 2009 at 10:06
of course it's possible, that's what VS2008 does by default. You just have a
set of
utf-8 encoded wchars, and they are evidently different. Illustration is here:
http://img90.imageshack.us/img90/9262/clipboard.png
If symbold are in the valid 1251 encoding, then all is Ok without any additional
setlocale() calls: http://img17.imageshack.us/img17/2185/clipboardcao.png
So you're doing smth wrong...
Original comment by george.hazan
on 2 Sep 2009 at 10:36
> Boris, I think you are wrong, but I have not enough klowledge in this field
at the
moment. I have just idea, not proofs.
Oh, I am sure I am right. Unicode was designed to make all chars in the world
representable, no to make locale independent lexical conversions.
George if Jabber is recompiled with VC2008 and Miranda is left compiled with
VC9
setlocale in the miranda32 will do nothing as different RTLs are used.
Original comment by borkra
on 2 Sep 2009 at 4:25
Oh, I see the discussion becomes a little too much theoretical.
2George. It's not utf8, it's windows native encoding, that is 16-bit Unicode as
it
was standardized at the moment of the Windows NT creation. Then it was strictly
16-
bit. (Note that now the Unicode isn't 16-bit only!) The encoding is almost
identical
to UTF-16 (except for its "private area" of 2048 16-bit values). It may look
similar
to utf-8, but it's not the same. If needed, we can discuss it in private
(mikekaganski@gmail.com).
2Boris. Well, you are partially right. Unicode was designed to contain symbols
and
encode symbols, and locales exist to help compare etc. But! Any locale creates
a
strict case-insensitive comparison. You know, if you use a locale, any text
containing any Unicode symbol can be unambiguously compared to any other text.
The
problem discussed here is that _if_we_use_different_locales_, we cannot get
unambiguous results on different machines. You say it's impossible to make such
comparison unambiguous universally. You are wrong. It's enougn to invent a
universal "locale" and demand using it in the case of SOME SPECIFIC texts (as
the
jids surely are). And there IS such a "locale" (you may think of a STRINGPREP
profile as of a kind of locale). It is specified in RFC 3920 and MUST be used
to
habdle jids as stated in the standard. And it's not the same as Windows
CompareString API uses (though it could be possible for MS to create some
Windows
locale ID to tell CompareString to make this comparison according to this RFC,
but
they don't care of XMPP). MS has some API to handle Stringprep and IDN, but
it's
distinct from the API you use in Miranda. A simple search for "idn" on
microsoft.com
brings some examples. I haven't knowledge about how to use that API, but the
very
existance of it means a lot. Meanwhile, it's enough to use some technique to
avoid
use of "C" locale in the internationalized string processing (either by using
setlocale() in each module, or by using functions that don't rely on C locales,
or
some other approach).
It's strange how the discussion goes. It seems like the only thing you guys do
is
try to convince me that Miranda is OK and I'm wrong. Hey, I don't want to say
something bad about Miranda! You make the great software, I tested many
programs to
choose one to use in my company and I chose Miranda as the best one.
But when I say there's a problem (and not just show the problem, but also spend
time
to debug it and try to fix it, and thus try to help you) you say "you're doing
smth
wrong". When I use my right to recompile the code to fit my needs, you say it
breaks
the software (if so, it's the software's fault! if it relies on some
presumptions
such as the same compiler then it should be stated explicitly, or be avoided as
much
as it possible). When I say the function compares texts incorrectly and show
the
example you say "compiler uses wrong encoding" or start arguing about use of
ANOTHER
function (the one I originally proposed to use!). What's wrong? I personally
already
have fixed all problems in my private build of Miranda that I use for business
(by
the way, I recompiled Miranda main module, too, and this alone haven't fixed
the
locale problem). I try to make Miranda better, as no one person is able to make
perfect program of a scale of Miranda by himself. Don't get me offending, I
just
want to help.
Sorry for offtopic, if it makes sense we could discuss this problem in private.
Original comment by mikekaga...@gmail.com
on 2 Sep 2009 at 10:20
This is what is shown in my environment. It's clearly visible that the codes of
the
characters are Unicode codes, not utf-8 encoding.
Original comment by mikekaga...@gmail.com
on 2 Sep 2009 at 11:39
Attachments:
And you are wrong, stop recompiling Jabber and your problem will go away.
Original comment by borkra
on 3 Sep 2009 at 3:13
Ok, let's conclude:
1. jabber.dll compares strings right if it's compiled with the same version of
rtl
that Miranda uses, and they both use dynamic rtl, so it's mandatory to
recompile
them together if a recompilation is needed (and this may apply to other
plugins/protocols that ship with Miranda).
2. If you write your own plugin for Miranda, you either must manually make sure
you
use proper locale, or demand compiling with Miranda to make use of Miranda's
locale
settings.
Original comment by mikekaga...@gmail.com
on 7 Sep 2009 at 12:10
Original issue reported on code.google.com by
mikekaga...@gmail.com
on 31 Aug 2009 at 1:55Attachments: