DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.64k stars 560 forks source link

non-ascii strings in DR API (was "ASSERT((unsigned short)(*wstr) <= UCHAR_MAX) in our_vsnprintf") #943

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From zhao...@google.com on October 09, 2012 16:57:53

~/Workspace/DynamoRIO/builds/build_x86_dbg.svn/bin32/drrun.exe -debug -ops "-msgbox_mask 0xf" -- ./unit_tests.exe --gtest_filter=IEImporterTest.IEImporter

int TNAME(our_vsnprintf)(TCHAR s, size_t max, const TCHAR fmt, va_list ap)

        /* insert the actual str representation */
        if (wstr != NULL) {
            while (*wstr) {
                if (max > 0 && (size_t)(s - start) >= max)
                    goto max_reached;
                if ((*c == _T('s') || *c == _T('S')) && decimal == 0)
                    break;  /* check string precision */
                decimal--;
                /* we only support ascii */
                ASSERT((unsigned short)(*wstr) \<= UCHAR_MAX);
                *s = (TCHAR) *wstr;
                s++;
                wstr++;
            }
        } else {

0:000> dt wstr Local var @ 0x17f5e8bc Type unsigned short* 0x0627c30e -> 0x4e2d

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=943


Breakdown of sub-tasks:

derekbruening commented 9 years ago

From bruen...@google.com on October 09, 2012 14:00:20

what is being printed? (paste the callstack)

derekbruening commented 9 years ago

From zhao...@google.com on October 09, 2012 14:19:22

0:000> dt buf Local var @ 0x2066e980 Type char[] [260] "\??\C:\src\tmp\usr\scoped_dir13248_13524\Favorites\a\J???"

0:000> dw 0x06212360 06212360 005c 003f 003f 005c 0043 003a 005c 0073 06212370 0072 0063 005c 0074 006d 0070 005c 0075 06212380 0073 0072 005c 0073 0063 006f 0070 0065 06212390 0064 005f 0064 0069 0072 0031 0033 0032 062123a0 0034 0038 005f 0031 0033 0035 0032 0034 062123b0 005c 0046 0061 0076 006f 0072 0069 0074 062123c0 0065 0073 005c 0061 005c 4e2d 6587 002e

D:\src\cygwin\home\zhaoqin\Workspace>ls C:\src\tmp\usr\scoped_dir13248_13524\Favorites\a ??.url

So it tries to open an file whose name is not ascii:

From emacs: c:/src/tmp/usr/scoped_dir13248_13524/Favorites/a: total used in directory 1 available 801100500 drwxrwxrwx 1 zhaoqin Domain Users 0 10-09 17:02 . drwxrwxrwx 1 zhaoqin Domain Users 0 10-09 17:04 .. -rw-rw-rw- 1 zhaoqin Domain Users 160 10-09 17:02 ea91~1.url

00000000: 2020 633a 2f73 7263 2f74 6d70 2f75 7372 c:/src/tmp/usr 00000010: 2f73 636f 7065 645f 6469 7231 3332 3438 /scoped_dir13248 00000020: 5f31 3335 3234 2f46 6176 6f72 6974 6573 _13524/Favorites 00000030: 2f61 3a0d 0a20 2074 6f74 616c 2075 7365 /a:.. total use 00000040: 6420 696e 2064 6972 6563 746f 7279 2031 d in directory 1 00000050: 2061 7661 696c 6162 6c65 2038 3031 3130 available 80110 00000060: 3035 3030 0d0a 2020 6472 7778 7277 7872 0500.. drwxrwxr 00000070: 7778 2020 3120 7a68 616f 7169 6e20 446f wx 1 zhaoqin Do 00000080: 6d61 696e 2055 7365 7273 2020 2030 2031 main Users 0 1 00000090: 302d 3039 2031 373a 3032 202e 0d0a 2020 0-09 17:02 ...
000000a0: 6472 7778 7277 7872 7778 2020 3120 7a68 drwxrwxrwx 1 zh 000000b0: 616f 7169 6e20 446f 6d61 696e 2055 7365 aoqin Domain Use 000000c0: 7273 2020 2030 2031 302d 3039 2031 373a rs 0 10-09 17: 000000d0: 3034 202e 2e0d 0a20 202d 7277 2d72 772d 04 .... -rw-rw- 000000e0: 7277 2d20 2031 207a 6861 6f71 696e 2044 rw- 1 zhaoqin D 000000f0: 6f6d 6169 6e20 5573 6572 7320 3136 3020 omain Users 160 00000100: 3130 2d30 3920 3137 3a30 3220 6561 3931 10-09 17:02 ea91 00000110: 7e31 2e75 726c 0d0a ~1.url..

derekbruening commented 9 years ago

From bruen...@google.com on October 09, 2012 14:46:09

but why does DR care about this file? what's the DR callstack? just some debug diagnostic on file tracking?

derekbruening commented 9 years ago

From zhao...@google.com on October 09, 2012 14:48:43

It is for logging purpose: in presys_OpenFile /* not always null-terminated */ _snprintf(buf, MIN(obj->ObjectName->Length/sizeof(obj->ObjectName->Buffer[0]), BUFFER_SIZE_ELEMENTS(buf)), "%S", name); NULL_TERMINATE_BUFFER(buf); LOG(THREAD, LOG_SYSCALLS|LOG_VMAREAS, 2, "syscall: NtOpenFile %s\n", buf);

derekbruening commented 9 years ago

From bruen...@google.com on October 15, 2012 08:49:22

we should discuss what our API plan is for non-ascii strings as we do have some in various paths. e.g.:

probably we should stick to char* everywhere and document that it's UTF-8. then we should use MBCS conversion from Windows kernel wide strings. we should check whether we can just use ntdll's RtlUnicodeToMultiByteN (for UTF-8) at earliest injection time. prior to my change to use our own ascii-only %S we were using ntdll's, and it doesn't do what we wanted anyway, so there's no regression.

derekbruening commented 9 years ago

From rnk@google.com on October 15, 2012 10:19:52

I agree with the UTF-8 plan, but I'm wondering if the encoding wouldn't be so hard to do without ntdll, if that proves difficult. Are wchar_t * strings always UTF-16, or can they be something else? I can't find a straight answer.

If it's always UTF-16, it shouldn't be too hard to write something that just does codepoint conversions and ignores subtle issues like case, whitespace, numbers, and normalization.

derekbruening commented 9 years ago

From bruen...@google.com on October 15, 2012 16:30:27

xref https://code.google.com/p/drmemory/issues/detail?id=1040

derekbruening commented 9 years ago

From bruen...@google.com on October 15, 2012 20:07:36

seems like RtlUnicodeToMultiByteSize() runs, but RtlUnicodeToMultiByteN() crashes, as does RtlInitCodePageTable() (so can't try RtlUnicodeToCustomCPN): but the fact the Size() seems to work is weird so this may be worth further investigation than my quick test

derekbruening commented 9 years ago

From zhao...@google.com on October 22, 2012 13:48:54

Owner: rnk@google.com

derekbruening commented 9 years ago

From bruen...@google.com on November 08, 2012 22:16:49

I started looking at this as drmem issue #1040 is on the list for the next release and it requires this. having our own utf8 <-> utf16 converter is pretty simple (already done). beyond that we need to use utf16 in the drinject and drconfig libraries and in the drdeploy tools. there will be some double conversion on windows, from utf16 from the OS to utf8 for DR and then inside DR to utf16 again to pass to the OS.

Summary: non-ascii strings in DR API (was "ASSERT((unsigned short)(_wstr) <= UCHAR_MAX) in our_vsnprintf")
_Owner:* bruen...@google.com

derekbruening commented 9 years ago

From bruen...@google.com on December 19, 2012 14:34:41

in r1702 :

UTF-8 support in the core

Still remaining for this issue is to switch to the wide versions of Windows library routines in drinjectlib, drconfiglib, drdeploy, and drsyms.

derekbruening commented 9 years ago

From bruen...@google.com on January 23, 2013 07:40:50

\ TODO tolower() assumes ascii

there's an implementation in:

Since all of our tolower uses are really for case-insensitive matching, we should be able to do locale-independent case folding via http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

derekbruening commented 9 years ago

From bruen...@google.com on February 11, 2013 08:13:25

\ TODO drsyms

we'll have to tweak which interface we use, b/c dbghelp doesn't have SymLoadModule64W:

% dumpbin /exports c:/windows/system32/dbghelp.dll | grep SymLoadModule 135 86 000185B8 SymLoadModule 136 87 000185B8 SymLoadModule64 137 88 00018480 SymLoadModuleEx 138 89 00018568 SymLoadModuleExW

but SymLoadModuleEx looks like a superset.

derekbruening commented 9 years ago

From bruen...@google.com on February 11, 2013 11:11:41

SymLoadModuleExW is not in VS2005 dbghelp.lib

derekbruening commented 9 years ago

From rnk@google.com on February 21, 2013 08:39:54

Action item: switch drsyms from Sym_A to Sym_W now that we have dbghelp_imports.lib

derekbruening commented 9 years ago

From bruen...@google.com on March 14, 2013 08:27:24

r1928 did the SymLoadModuleExW swith, but issue #1085 remains.

for this case, summarizing the action items left:

Labels: Hotlist-Release

derekbruening commented 9 years ago

From bruen...@google.com on November 04, 2013 13:21:51

\ TODO drdeploy, drconfiglib, and drinjectlib

drdeploy.c needs GetEnvironmentVariableA, _access(), main(), etc. all converted to unicode. should do in same fashion as drmem frontend: keep cross-platform via macros I suppose.

dr_config.c calls GetEnvironmentVariableA, SetEnvironmentVariableA, CreateDirectoryA. convert_to_tchar() needs to use MultiByteToWideChar() + need to ensure all sources are UTF-8.

injector.c:

ifdef UNICODE

error dr_inject.h and drdeploy.c not set up for unicde

endif

there will be some double conversion on windows, from utf16 from the OS to utf8 for DR and then inside DR to utf16 again to pass to the OS.

should share w/ drmem frontend: create new lib? => issue #1079

derekbruening commented 9 years ago

From bruen...@google.com on August 28, 2014 14:30:02

xref issue #1530 : drfrontendlib now needs some work

derekbruening commented 9 years ago

\ TODO DrMem has tolower() calls

We should replace tolower() calls in text_matches_pattern() and strcasestr() with the same soln we use for DR: ideally our own locale-independent case folding.

derekbruening commented 8 years ago

In commit 46cbef71027af7e3a279321560a7b0f876e25833

i#943 non-ascii: non-ascii username and non-ascii app path support

Fixes the drinjectlib API to handle non-ascii application path and
arguments.

Fixes drdeploy's tool file and other (non-config) file manipulation to
handle non-ascii paths.

Fixes drdeploy.c to use _tmain and handle non-ascii parameters.

Fixes drconfiglib's GetEnvironmentVariableA, SetEnvironmentVariableA,
CreateDirectoryA, convert_to_tchar, and fopen to handle non-ascii strings.

This means that non-ascii usernames and non-ascii app paths now work.
However, a non-ascii DynamoRIO path will not yet work.

For the remaining work (we also have the tolower and hashtable case issues too):

*\ TODO TSTR_FMT needs to be removed and all uses replaced, or use our_snprintf

dr_config.c is using Windows _snprintf with %S/%ls to convert to/from utf16, which is broken and does not work.

We either need to have it use io.c (requires refactoring to let io.c pull in just types from globals.h w/o other stuff; maybe make a static lib for it), or do explicit conversion everywhere.

*\ TODO injection is calling LoadLibraryA with a utf8 path

Thread injection will be loading a utf8 dll path read from the config file and passing it to LoadLibraryA.

Can we afford to call our conversion at inject time? If we switch to storing utf16 in the config file we'll have to change a lot of code that's passing char*.

derekbruening commented 8 years ago

The TSTR_FMT uses in dr_config.c were fixed in 639b429

A bug in full_path of the exe was fixed in af45400

Thus a non-ascii username, non-ascii app path, and non-ascii client options (like DrM logdir) all work now.

The biggest thing that does not work is a non-ascii DR path. Other minor things remaining are the tolower calls in DR, drcontainers, and DrM.

derekbruening commented 8 years ago

drcov2lcov does not handle non-ascii as it uses FindFirstFileA.