Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

codecvt::in() incorrectly handles partial multibyte inputs #13837

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR13759
Status NEW
Importance P normal
Reported by Philip Puryear (philippuryear@gmail.com)
Reported on 2012-09-04 00:14:37 -0700
Last modified on 2015-09-27 13:01:31 -0700
Version unspecified
Hardware All All
CC kariya_mitsuru@hotmail.com, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com
Fixed by commit(s)
Attachments test.cc (846 bytes, text/x-c++src)
Blocks
Blocked by
See also
Created attachment 9151
testcase

When supplied with a partial multibyte input (i.e. a multibyte prefix),
codecvt::in() returns codecvt_base::error instead of codecvt_base::partial.

Steps to reproduce:
1. Compile the attached testcase with e.g.
       $ clang -std=c++11 -stdlib=libc++ -o test test.cc
   and run it.

Expected results: The testcase should print "partial".
Actual results: The testcase prints "error".

Additional notes:
I've reproduced this bug on both my Mac and Linux systems using the above
testcase, which calls codecvt::in() on the first byte of the three-byte UTF-8
string "ℝ".

The C++11 standard (22.4.1.4.2) seems to imply that codecvt_base::partial
should be returned, since "additional source elements are needed before another
destination element can be produced". However, in the above testcase, libstdc++
4.7 on my Linux machine returns codecvt_base::ok, so I'm not sure what the
correct behavior is.

The code responsible (locale.cpp:1386) seems to have some inverted logic. In
particular,
> n = mbsnrtowcs(...);
> if (n == size_t(-1))
>     ...
> if (n == 0)
>     return error;
seems wrong, since mbsnrtowcs() returns -1 on error, not 0 (which simply
indicates that the length of the wide output string is 0).

I'm *fairly* sure this is a bug, however I'm kind of poking around in the dark
here, so please correct me if this is the intended behavior.
Quuxplusone commented 12 years ago

Attached test.cc (846 bytes, text/x-c++src): testcase

Quuxplusone commented 9 years ago
committed revision 233012 to fix this problem.

I'm going to leave this bug open, though, until I figure what the correct
behavior is.
The fix merely deals with the error correctly.
Quuxplusone commented 8 years ago

As pointed out above, I think that it should return partial, not ok.

cf. http://melpon.org/wandbox/permlink/8Jo9s9qODgywSmtP

This behavior causes PR24929.

Additionally, if "from" ends with an incomplete multibyte sequence followed by a null byte, "from_next" should point the first byte of an incomplete multibyte sequence, not a null byte.

C++98 (or later) standard 22.2.1.5.2 codecvt virtual functions [lib.locale.codecvt.virtuals] p.2 says, "It always leaves the from_next and to_next pointers pointing one beyond the last element successfully converted."

cf. http://melpon.org/wandbox/permlink/7xsTdIqqmE1X3Adg

Quuxplusone commented 8 years ago
I think that do_in should behave as below.

====================================================================================
codecvt<wchar_t, char, mbstate_t>::result
codecvt<wchar_t, char, mbstate_t>::do_in(state_type& st,
    const extern_type* frm, const extern_type* frm_end, const extern_type*& frm_nxt,
    intern_type* to, intern_type* to_end, intern_type*& to_nxt) const {
    frm_nxt = frm;
    to_nxt = to;
    if (frm == frm_end) {
        mbstate_t save_st = st;
        return __mbrtowc_l(NULL, NULL, 0, &save_st, __l) == 0 ? ok : partial;
    }
    do {
        if (to_nxt == to_end)
            return partial;
        size_t n = __mbrtowc_l(to_nxt, frm_nxt, frm_end - frm_nxt, &st, __l);
        switch (n) {
        case size_t(-1):
            return error;
        case size_t(-2):
            frm_nxt = frm_end;
            return partial;
        case 0:
            ++frm_nxt;
            break;
        default:
            frm_nxt += n;
            break;
        }
        ++to_nxt;
    } while (frm_nxt != frm_end);
    return ok;
}
====================================================================================