Duplicate contacts of a jid containing non-english letters

GoogleCodeExporter commented 9 years ago

Miranda Version                  : 0.8.5
Unicode Build                    : Yes
Jabber Plugin Version #          : 0.8.4.0

What steps will reproduce the problem?
1. Add a contact whose username contains mixed-case non-english 
(specifically russian) letters to the contact list.

What is the expected result?
We expect the contact to behave normally.

What happens instead?
The contact is added, but after a while (after the contact authorisation 
or some other use of the contact, even after the state notification) a 
ghost contact is created that is the same as the original but has only 
lower-case letters. One of them is shown as offline. You cannot delete any 
one without the other to become unauthorised, and if later you authorise 
the remaining, the extra copy will appear again (no matter which one was 
deleted).

This is old bug (http://bugs-archive.miranda-im.org/view.php?id=683) that 
prevents using miranda in corporate environments where usernames and 
passwords are permitted to be non-english.
The cause and resolution is also known (see page above or attached file).

Original issue reported on code.google.com by mikekaga...@gmail.com on 31 Aug 2009 at 1:55

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by alex.zif...@gmail.com on 31 Aug 2009 at 7:25

Changed state: Assigned
Added labels: Component-Protocol-Jabber

GoogleCodeExporter commented 9 years ago

The cause of seemingly wrong behaviour of the _tcsnicmp function is that it 
depends 
on the current locale, which is set to "C" locale by default, and needs to be 
changed by setlocale to be used by numerous c run-time functions that depend on 
it. 
Seems like Miranda doesn't do it, and if it uses those functions it will sooner 
or 
later suffer from being non-localizable in a sense. And it seems impractical to 
use 
locale-dependent functions to handle unicode strings that should be treated in 
a 
prescribed way (like the STRINGPREP spec for IDN). It's a possibility that 
there 
will be users from different countries in a contact list that have letters from 
different languages in their jids. And in such an event, even using 
CompareString 
(that also depends on locale) or setlocale won't solve the problem. My propose 
is 
move to GNU libidn or some other suitable library in a future release.

Original comment by mikekaga...@gmail.com on 1 Sep 2009 at 6:29

GoogleCodeExporter commented 9 years ago

You are incorrect Miranda core does change current locale to C++ one so it 
should 
handle Unicode chars correctly although if you do mix compilers (and this is 
what you 
do) this not going to work. But formal distribution have everything compiled 
with the 
same compiler, so there is no problem. 

And BTW Windows lstrcmpi is locale independent.... So no need for GNU stuff.

Original comment by borkra on 1 Sep 2009 at 9:12

GoogleCodeExporter commented 9 years ago

I disagree.
1. [quote]Miranda core does change current locale to C++[/quote] - what does it 
mean? There's no locale in C standard nor C++ standard called "C++". C locale 
is one 
that is used in C (and C++) alphabet definition, and as such, it knows about 
ASCII 
characters only (0..127). C++ has its own representation of the locale concept, 
naturally it's a class with a broader responsibilities, but the locale names 
(and 
meanings) are the same - the default "C++ locale" is "C". And naturally, it's 
vital 
to either use only those functions that depend on one chosen representation (C 
or 
C++, as they are independent) or set them both to same value.
2. [quote]if you do mix compilers (and this is what you 
do)[/quote] - I use MS Visual Studio 2008 (and the default config converted 
from .dsw provided in the source).
3. [quote]And BTW Windows lstrcmpi is locale independent[/quote] - 
http://msdn.microsoft.com/en-us/library/ms647489(VS.85).aspx (note I'm quoting 
Microsoft): "The function calls CompareString, _using_ _the_ _current_ _thread_ 
_locale_" (underscore is mine). So it's locale-dependent (but again, I haven't 
argued against lstrcmpi or CompareString in the beginning, in fact, I fixed the 
code 
with CompareString that is the work horse of lstrcmpi). What I have said is the 
ORIGINAL code of Miranda uses different functions to do the same things - this 
time, 
case insensitive string comparison.
4. [quote]So no need for GNU stuff[/quote] - Well, I'm not sure if your post is 
against GNU, but the library I propose is under LGPL that is compatible with 
your 
license, and it's well documented. The idea here is that it's possible for me 
to add 
to my contact list people from over the globe, all possibly having localized 
jids as 
the standard permits (one may have hyeroglyphs, another may be arabic, yet 
another 
could have umlauts etc.), and when Miranda will compare case-insensitively the 
jid 
of an incoming message to jid in its contact list using any function that 
depends on 
a locale, it will inevitably get false-negative answers, thus adding extra 
copies of 
contacts. Naturally enough, these events are exceptionally rare now as there 
not too 
many localized jids in the world now. But as the protocol will (I hope) become 
more 
popular, these things will become more frequent, so it's better to prepare and 
be 
standards-complying.
I'm not saying this is the first priority, but some future version should do 
IDN 
processing in the way the Standard prescribes. I just got curious about how 
this 
SHOULD be done, and yesterday I run across this library - I haven't yet 
implemented 
the libidn support in Miranda, but when (if) I will I'll post the result here.

Original comment by mikekaga...@gmail.com on 1 Sep 2009 at 10:06

GoogleCodeExporter commented 9 years ago

Mike, that was the very interesting passage (especially about people from all 
over
the globe), but Miranda already uses unicode mapping to compare unicode strings.

First of all, Miranda resets the locale, as described in the MSDN:
  setlocale(LC_ALL,"C");   // in effect by default
  printf("\n%d",_wcsicmp(L"ä", L"Ä"));   // compare fails
  setlocale(LC_ALL,"");
  printf("\n%d",_wcsicmp(L"ä", L"Ä"));   // compare succeeds
You can find the call of setlocale(LC_ALL,"") yourself, it's easy.

So after that wcsicmp will use the built-in table for Unicode chars with codes
0..255, and LCMapString for all another symbols. It doesn't make any difference 
from
your code.

2. The situation with mixed-case jids is successfully tested on many servers, 
mainly
on various ejabberd clones. It definitely works there.

3. If you will provide a network log with such a problem, when Miranda creates a
duplicate contact, your help will be appreciated.

4. You better don't use the converted DSW's, cause it may create a lot of 
problems.
Use the native VS2008 projects & solutions, you will find them in the /bin9 
folder.

Original comment by george.hazan on 1 Sep 2009 at 10:52

GoogleCodeExporter commented 9 years ago

"3. [quote]And BTW Windows lstrcmpi is locale independent[/quote] - 
http://msdn.microsoft.com/en-us/library/ms647489(VS.85).aspx (note I'm quoting 
Microsoft): "The function calls CompareString, _using_ _the_ _current_ _thread_ 
_locale_" (underscore is mine)."

mikekaganski, you did not understand what Microsoft is talking about. 

CompareString is locale independent for as much as it can be. But if for you 
European 
languages relationship between upper and lower case letters is 1 to 1, this is 
not 
true for Asian languages where for the same lower case letter there could be 
different upper case letter depending on your language. Hieroglyphic writing 
have 
completely different set of rules :). So in such cases knowledge of locale is 
essential, to make proper conversion.

I've been using lstrcmpi for a few years in my plugins and know very well that 
does 
work well for the complete Unicode range regardless of locale you are in.

Original comment by borkra on 2 Sep 2009 at 12:21

GoogleCodeExporter commented 9 years ago

2george.hazan
Thank you for finding my writing interesting :)
Yes, I found the code you mentioned (setlocale(LC_ALL,"")) in the very 
beginning of 
the Miranda's WinMain function. And this is the only place in Miranda where it 
can 
be found.
Yes, I agree that calling it will solve my case (as I wrote in the Comment 2). 
But 
still I'm afraid that Miranda's Jabber dll suffers from having its locale set 
to "C". To check if it's true I opened Miranda project (this time I did as you 
told, 
from native solution, but from bin8 directory), opened file 
miranda\protocols\JabberG\jabber.cpp, and added the next lines to the DllMain 
in the 
very beginning:
    const TCHAR tststr[] = _T("абвгдеёжзийклмнопрстуфхцчшщъыьэюя"); // Russian 
alphabet lowercase
    const TCHAR TSTSTR[] = _T("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"); // Russian 
alphabet uppercase
    int mk_result = _tcsnicmp(tststr, TSTSTR, -1);
    MessageBox(0, (mk_result == 0) ? _T("Equal") : _T("Inequal"), _T
("Comparison"), MB_OK | MB_ICONINFORMATION | MB_TOPMOST);
And when I compile this dll (Release Unicode configuration, nothing else 
changed) 
and run Miranda with it I get message saying "Inequal"! I use Russian Windows 
XP 
with Regional options set to Russian.
I suppose it's because the rtl is stateful, and it uses some global variables 
(e.g. 
__locale_changed) that are global only in a project, not across different 
projects 
of a solution. So when the .dll opens, it has its own locale different from the 
.exe 
(I don't know it for sure, it's just guessing). Anyway, this test clearly shows 
that 
the _tcsnicmp doesn't work the way it's expected to.
I sertainly will try to get the network log of the case You asked (but I don't 
understand in what way the network log may be useful here).

2borkra
Yes, I understand what you are talking about (and what Microsoft is, too). I 
just 
want to make clear what is my point:
Yes, if you change the code that it will use lstrcmpi instead of tcsnicmp, it 
will 
fix MOST of problems (and all my problems; and I did just this, but using the 
underlying CompareString!). But it will fix only those cases where the jid 
contains 
only those characters that have one-to-one upper-to-lowercase relations (this 
is 
exactly what you said, I agree with you). But when this will not be the case 
(as you 
justly said about Asian languages), the approach will fail and the comparison 
will 
become locale-dependent (ant if it would concern a message, it would be OK, but 
it 
is all about jid, that must be locale-independent and must be treated as 
standard 
says). I don't think this problem is high-priority; I just say that there IS a 
problem that MAY become prominent in the future.

Thank you for your attention to this problem.

Original comment by mikekaga...@gmail.com on 2 Sep 2009 at 3:34

GoogleCodeExporter commented 9 years ago

1. No, you did not understand, what I am saying. 

For these Asian languages it's impossible to make locale independent compare no 
matter 
what you do. As upper/lower relationship is one to many. And Microsoft API does 
treat 
the comparison the way standard says.

2. If you will not recompile anything with different compiler and use Miranda s 
it 
provided, do you still have a problem?

Original comment by borkra on 2 Sep 2009 at 7:26

GoogleCodeExporter commented 9 years ago

Mike,

VS2008 stores all non-ASCII strings as UTF8 by default, that's why they are 
unequal
:) just look at these strings in the debugger. Another idea is that you 
compiled your
dll with static runtime, that's why the setlocale() call inside the code doesn't
affect Jabber.

Original comment by george.hazan on 2 Sep 2009 at 9:03

GoogleCodeExporter commented 9 years ago

2borkra
Boris, I think you are wrong, but I have not enough klowledge in this field at 
the 
moment. I have just idea, not proofs.
I cannot use Miranda in my environment without recompilation of Jabber, as it 
just 
won't authorise me (as described in Issue 187). So I cannot answer your 
question at 
the moment.

2george.hazan
I'm sorry to say that, but it's impossible. TCHAR compiles to wchar_t if 
UNICODE is 
defined (and it is in the Release Unicode configuration), and wchar_t in the VC 
2008 
(as in most other implementations) is 2-byte type. The text that is stored in a 
wchar_t[] constant is encoded UTF-16. UTF-8 is multibyte encoding, i.e. a char 
may 
be represented in arbitrary count of bytes, it may be 1 byte (in the case of 
ascii) 
or 2 bytes (as is true for any russian letter) or up to 5 bytes. The test that 
is 
utf-8-encoded is stored in char[].
Another thing is that I have not changed anything in the project settings, as I 
mentioned. Ant there stays "Multi-threaded DLL (/MD)" for Runtime library, and 
the 
dll that is generated has size 670208 B, even less then the one that ships in 
the 
official package (697963 B). If it would compile statically it would be 
significantly larger.
Do you say that if you compile it yourself you will get different results? :)

Original comment by mikekaga...@gmail.com on 2 Sep 2009 at 9:59

GoogleCodeExporter commented 9 years ago

Oh, I forgot to say that if I add setlocale() before the strings in DllMain it 
compares correctly, so the code

    const TCHAR tststr[] = _T("абвгдеёжзийклмнопрстуфхцчшщъыьэюя");
    const TCHAR TSTSTR[] = _T("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ");
    int mk_result = _tcsnicmp(tststr, TSTSTR, -1);
    MessageBoxA(0, (mk_result == 0) ? "Eq" : "NEq", setlocale(LC_ALL, NULL), 
MB_OK | MB_ICONINFORMATION);
    setlocale(LC_ALL, "");
    mk_result = _tcsnicmp(tststr, TSTSTR, -1);
    MessageBoxA(0, (mk_result == 0) ? "Eq" : "NEq", setlocale(LC_ALL, NULL), 
MB_OK | MB_ICONINFORMATION);

brings the first message box saying "NEq" and having the caption "C", and the 
second 
saying "Eq" and having the caption "Russian_Russia.1251".

Original comment by mikekaga...@gmail.com on 2 Sep 2009 at 10:06

GoogleCodeExporter commented 9 years ago

of course it's possible, that's what VS2008 does by default. You just have a 
set of
utf-8 encoded wchars, and they are evidently different. Illustration is here:
http://img90.imageshack.us/img90/9262/clipboard.png

If symbold are in the valid 1251 encoding, then all is Ok without any additional
setlocale() calls: http://img17.imageshack.us/img17/2185/clipboardcao.png

So you're doing smth wrong...

Original comment by george.hazan on 2 Sep 2009 at 10:36

GoogleCodeExporter commented 9 years ago

> Boris, I think you are wrong, but I have not enough klowledge in this field 
at the 
moment. I have just idea, not proofs.

Oh, I am sure I am right. Unicode was designed to make all chars in the world 
representable, no to make locale independent lexical conversions.

George if Jabber is recompiled with VC2008 and Miranda is left compiled with 
VC9 
setlocale in the miranda32 will do nothing as different RTLs are used.

Original comment by borkra on 2 Sep 2009 at 4:25

GoogleCodeExporter commented 9 years ago

Oh, I see the discussion becomes a little too much theoretical.

2George. It's not utf8, it's windows native encoding, that is 16-bit Unicode as 
it 
was standardized at the moment of the Windows NT creation. Then it was strictly 
16-
bit. (Note that now the Unicode isn't 16-bit only!) The encoding is almost 
identical 
to UTF-16 (except for its "private area" of 2048 16-bit values). It may look 
similar 
to utf-8, but it's not the same. If needed, we can discuss it in private 
(mikekaganski@gmail.com).

2Boris. Well, you are partially right. Unicode was designed to contain symbols 
and 
encode symbols, and locales exist to help compare etc. But! Any locale creates 
a 
strict case-insensitive comparison. You know, if you use a locale, any text 
containing any Unicode symbol can be unambiguously compared to any other text. 
The 
problem discussed here is that _if_we_use_different_locales_, we cannot get 
unambiguous results on different machines. You say it's impossible to make such 
comparison unambiguous universally. You are wrong. It's enougn to invent a 
universal "locale" and demand using it in the case of SOME SPECIFIC texts (as 
the 
jids surely are). And there IS such a "locale" (you may think of a STRINGPREP 
profile as of a kind of locale). It is specified in RFC 3920 and MUST be used 
to 
habdle jids as stated in the standard. And it's not the same as Windows 
CompareString API uses (though it could be possible for MS to create some 
Windows 
locale ID to tell CompareString to make this comparison according to this RFC, 
but 
they don't care of XMPP). MS has some API to handle Stringprep and IDN, but 
it's 
distinct from the API you use in Miranda. A simple search for "idn" on 
microsoft.com 
brings some examples. I haven't knowledge about how to use that API, but the 
very 
existance of it means a lot. Meanwhile, it's enough to use some technique to 
avoid 
use of "C" locale in the internationalized string processing (either by using 
setlocale() in each module, or by using functions that don't rely on C locales, 
or 
some other approach).

It's strange how the discussion goes. It seems like the only thing you guys do 
is 
try to convince me that Miranda is OK and I'm wrong. Hey, I don't want to say 
something bad about Miranda! You make the great software, I tested many 
programs to 
choose one to use in my company and I chose Miranda as the best one.
But when I say there's a problem (and not just show the problem, but also spend 
time 
to debug it and try to fix it, and thus try to help you) you say "you're doing 
smth 
wrong". When I use my right to recompile the code to fit my needs, you say it 
breaks 
the software (if so, it's the software's fault! if it relies on some 
presumptions 
such as the same compiler then it should be stated explicitly, or be avoided as 
much 
as it possible). When I say the function compares texts incorrectly and show 
the 
example you say "compiler uses wrong encoding" or start arguing about use of 
ANOTHER 
function (the one I originally proposed to use!). What's wrong? I personally 
already 
have fixed all problems in my private build of Miranda that I use for business 
(by 
the way, I recompiled Miranda main module, too, and this alone haven't fixed 
the 
locale problem). I try to make Miranda better, as no one person is able to make 
perfect program of a scale of Miranda by himself. Don't get me offending, I 
just 
want to help.

Sorry for offtopic, if it makes sense we could discuss this problem in private.

Original comment by mikekaga...@gmail.com on 2 Sep 2009 at 10:20

GoogleCodeExporter commented 9 years ago

This is what is shown in my environment. It's clearly visible that the codes of 
the 
characters are Unicode codes, not utf-8 encoding.

Original comment by mikekaga...@gmail.com on 2 Sep 2009 at 11:39

Attachments:

debug.PNG

GoogleCodeExporter commented 9 years ago

And you are wrong, stop recompiling Jabber and your problem will go away.

Original comment by borkra on 3 Sep 2009 at 3:13

GoogleCodeExporter commented 9 years ago

Ok, let's conclude:
1. jabber.dll compares strings right if it's compiled with the same version of 
rtl 
that Miranda uses, and they both use dynamic rtl, so it's mandatory to 
recompile 
them together if a recompilation is needed (and this may apply to other 
plugins/protocols that ship with Miranda).
2. If you write your own plugin for Miranda, you either must manually make sure 
you 
use proper locale, or demand compiling with Miranda to make use of Miranda's 
locale 
settings.

Original comment by mikekaga...@gmail.com on 7 Sep 2009 at 12:10

ValentijnNK / miranda

Duplicate contacts of a jid containing non-english letters #188