Null-terminated strings whose maximum length crosses page boundaries

GoogleCodeExporter commented 8 years ago

Hey guys, 

Regarding objects that cross page boundaries (and one or more pages are not 
memory resident)...

I have this structure defined:

'_IMAGE_EXPORT_DIRECTORY': [ 0x28, {
        'Base': [ 0x10, ['unsigned int']],
        'NumberOfFunctions': [ 0x14, ['unsigned int']], 
        'NumberOfNames': [ 0x18, ['unsigned int']], 
        'AddressOfFunctions': [ 0x1C, ['unsigned int']], 
        'AddressOfNames': [ 0x20, ['unsigned int']], 
        'AddressOfNameOrdinals': [ 0x24, ['unsigned int']], 
    } ],

To parse the EAT of some binaries, I do:

exp_dir = obj.Object('_IMAGE_EXPORT_DIRECTORY', vm = self.obj_vm, offset = 
self.DllBase + exp_addr)

If exp_dir is not None, I begin using the structure members, but ran into this 
error:

Invalid Address 0x00BE4000, instantiating AddressOfNames

It turns out, NumberOfFunctions exists at 0x00BE3FFC while NumberOfNames exists 
at 0x00BE4000, and that second address is not memory resident. 

So the question is – should obj.Object return None if the object crosses page 
boundaries and only some of the pages are available? 

Gleeda suggested a safe/unsafe option which we could use to manipulate how 
obj.Object behaves in these cases. 

AW has seen things like this before, usually when working with LDR_MODULES, and 
just asked that I enter this here for tracking and discussion. 

Thoughts?

Original issue reported on code.google.com by michael.hale@gmail.com on 7 Jan 2011 at 2:07

GoogleCodeExporter commented 8 years ago

Hmmmm, I think an object should return as much information as possible, so if 
NumberOfFunctions is there, then _IMAGE_EXPORT_DIRECTORY should allow access to 
it.  

An interesting question is, if the first half of a structure is gone, should 
you be able to instantiate an object at an invalid address, and read the tail 
half of it's members?

At the moment, if an object is created using obj.Object(), it'll return a 
NoneObject, but if the object is instantiated directly, it'll raise an 
InvalidOffsetError.  If the CType created by obj.Object() is valid, you won't 
get a NoneObject, but when you then ask for a member, it'll be instantiated 
directly at an invalid address and raise the exception.

So we've got several options:

1) Leave it as is
2) Catch InvalidOffsetErrors in the CType member lookup and convert them to an 
AttributeError
3) Catch InvalidOffsetErrors in the CType member lookup and return NoneObjects
4) Force the CType member lookup to use obj.Object when returning members
5) Have obj.Object return a NoneObject if any of the addresses within 
range(offset, offset+structsize) are invalid
6) Something else

So some obversations on these:

1) Not very helpful
2) Changing one error for another
3) What NoneObjects were designed for, but sometimes difficult to spot.  Also 
changes invalid offset handling from exception handling to explicit checking 
for NoneObjects.
4) Possibly a good idea, but needs discussing with scudette, because it may 
turn our pre-compiled vtype language into an interpretted language
5) May cause confusion if it isn't recursive and recursive checking will likely 
invalidate most objects.  Will likely slow down object creation for large 
objects.
6) I dunno what it'd be

So, 3 & 5 look the most viable options, 4 would be a big shift, and 6 has the 
possibility of being a brilliant solution!  5:)  It would probably be worth 
CCing in some of the others to get their opinion...

Original comment by mike.auty@gmail.com on 7 Jan 2011 at 2:35

GoogleCodeExporter commented 8 years ago

maybe this doesn't have anything to do with it, but the intel address space at 
least supports read and zread which have different behaviors when an unmapped 
page is hit:

http://code.google.com/p/volatility/source/browse/branches/Volatility-1.4_rc1/vo
latility/plugins/addrspaces/intel.py?r=584#303

Original comment by atc...@gmail.com on 9 Jan 2011 at 1:53

GoogleCodeExporter commented 8 years ago

Original comment by mike.auty@gmail.com on 21 Jan 2011 at 8:26

Added labels: Milestone-Release1.5

GoogleCodeExporter commented 8 years ago

Sorry for the bugspam, but better to get this right now than later once it's 
more in use.

Original comment by mike.auty@gmail.com on 4 Feb 2011 at 9:35

Added labels: Milestone-1.5.x
Removed labels: Milestone-Release1.5

GoogleCodeExporter commented 8 years ago

Just a note that we'll also have to handle the case of strings, or similar, 
which span pages.  Most likely we'll need to figure out the size of the struct, 
stick the relevant pages together and then operate on that.  Might take a 
little work...

Original comment by mike.auty@gmail.com on 21 Jul 2011 at 11:07

GoogleCodeExporter commented 8 years ago

Original comment by mike.auty@gmail.com on 26 Jul 2011 at 6:45

GoogleCodeExporter commented 8 years ago

Original comment by jamie.l...@gmail.com on 26 Jul 2011 at 11:40

GoogleCodeExporter commented 8 years ago

Hey guys, I'd like to expand on what Ikelos said about handling strings. In 
particular, a structure member that points to a unicode string:

'_SERVICE_RECORD' : [ None, {
    'ServiceList' : [ 0x0, ['_SERVICE_LIST_ENTRY']],
    'ServiceName' : [ 0x8, ['pointer', ['UnicodeString', dict(length = 256 * 2)]]],
    'DisplayName' : [ 0xc, ['pointer', ['UnicodeString', dict(length = 256 * 2)]]],

I subclassed basic.String to create the UnicodeString type. It looks like the 
following (differs from basic.String only in the split statement that uses 
\x00\x00 instead of \x00

class UnicodeString(basic.String):

    def __str__(self):
        data = self.v()
        result = data.split("\x00\x00")[0]
        if not result:
            return ""
        return result

According to MSDN 
(http://msdn.microsoft.com/en-us/library/ms682450%28VS.85%29.aspx) the 
ServiceName and DislayName can be max 256 characters (so 512 bytes).

When testing a plugin, some of the ServieName and DisplayName fields were 
coming up blank, for example:

$ python vol.py -f tdl3.vmem svcscan

....

Offset: 0x82aec0
Order: 258
Process ID: -
Service Name: <----------------------------- this should not be blank!!
Display Name: Network Provisioning Service
Service Type: SERVICE_WIN32_SHARE_PROCESS
Service State: SERVICE_STOPPED
Binary Path: -

I wondered why it was blank and went to check it out in volshell. The base of 
the _SERVICE_RECORD structure is 0x82aec0:

In [2]: dd(0x82aec0)
0082aec0  0082ae30 00000000 0082af34 000b45b8
0082aed0  00000102 00000000 76724573 00000000
0082aee0  00000000 00000000 00000020 00000001

And so 0082af34 is the ServiceName field. 

In [4]: db(0x0082af34)
0x0082af34  78 00 6d 00 6c 00 70 00 72 00 6f 00 76 00 00 00   x.m.l.p.r.o.v...
0x0082af44  00 00 00 00 17 32 12 00 00 10 00 00 78 01 82 00   .....2......x...
0x0082af54  78 01 82 00 00 00 00 00 00 00 00 00 00 00 00 00   x...............

So the name is xmlprov. The problem is we can't read the whole 256 * 2 = 512 
bytes:

In [5]: db(0x0082af34, length=512)
Memory unreadable at 0082af34

It turns out, we can't read past 0x0082afff (page cuts off at 0x0082b000). 

Just to make sure the memory is allocated, I checked with vadinfo. 

VAD node @814f29a0 Start 00820000 End 0084ffff Tag VadS
Flags: PrivateMemory
Commit Charge: 36 Protection: 4

In this case I suppose I'd like the UnicodeString to contain as much of the 
string as possible before crossing the page boundary. Since "xmlprov" is much 
smaller than 256 chars anyway, I'd really be getting the full string...but of 
course that won't be the case always (especially if we're talking about 
structures crossing page boundaries like in the original issue here). 

Anyway, just wanted to document this, and CC some others.

Original comment by michael.hale@gmail.com on 1 Oct 2011 at 8:02

GoogleCodeExporter commented 8 years ago

This is still occurring even after the page translation patch.  It doesn't 
happen often, but here we can see it happening with a registry key:

$ ./vol.py -f memory.img --profile=Win7SP0x86 printkey -K 
'ControlSet001\Control\WMI\Autologger\EventLog-System'
Volatile Systems Volatility Framework 2.1_alpha
Legend: (S) = Stable   (V) = Volatile

----------------------------
Registry: \REGISTRY\MACHINE\SYSTEM
Key name: EventLog-System (S) 
Last updated: 2009-07-14 04:37:09 

Subkeys:
  (S) {01979c6a-42fa-414c-b8aa-eee2c8202018}
  (S) {04268430-d489-424d-b914-0cff741d6684}
  (S) {06edcfeb-0fd0-4e53-acca-a6f8bbf81bcb}
  (S) {0c478c5b-0351-41b1-8c58-4a6737da32e3}
Traceback (most recent call last):
  File "./vol.py", line 135, in <module>
    main()
  File "./vol.py", line 126, in main
    command.execute()
  File "volatility/commands.py", line 101, in execute
    func(outfd, data)
  File "volatility/plugins/registry/printkey.py", line 101, in render_text
    for s in rawreg.subkeys(key):
  File "volatility/win32/rawreg.py", line 120, in subkeys
    if i.Signature.v() == NK_SIG:
  File "volatility/obj.py", line 486, in v
    (val,) = struct.unpack(self.format_string, data)
struct.error: unpack requires a string argument of length 2

Ikelos looked at this before the patch, this is what he had to say about it 
then:

"This appears to be a variant on issue 57, and not something we can fix
in the next five days.  Basically, the two character data block falls
across the very end of a page (so Signature.obj_offset is 0x#####fff).
The next byte is on a different page, and I guess that this can't even
read another byte after its location.

Really, we should be determining the length of each value, ensuring it's
all within the page, and if not then we need to reconstruct the
consecutive pages, and work on those.

If one of the pages isn't resident then we probably ought to throw a
page fault exception, but that's up for debate (basically, do we think
plugin devs will spend the time to always check for page faults)..."

It is still exhibiting this behavior after the patch, however:

$ svn info
[snip]
Revision: 1153
[snip]

Original comment by jamie.l...@gmail.com on 15 Dec 2011 at 7:52

GoogleCodeExporter commented 8 years ago

also see it in Revision 1159

Original comment by jamie.l...@gmail.com on 15 Dec 2011 at 8:59

GoogleCodeExporter commented 8 years ago

Ok, so here's a patch that vastly improves the transparency of NoneObjects, so 
that they can be used as pretty much any object and the result should always be 
the NoneObject.  Specifically, we can now concatenate with strings and the 
error will follow through.

I've also tried to ensure that intel/legacyintel/crash/hibernate all return 
NoneObject if there's a problem reading.  Please give the patch a go with any 
of the images exhibiting exceptions related to this and let me know if there's 
any further problems...

Original comment by mike.auty@gmail.com on 17 Dec 2011 at 2:04

Attachments:

volatility-issue-57.patch

GoogleCodeExporter commented 8 years ago

This issue was closed by revision r1164.

Original comment by mike.auty@gmail.com on 19 Dec 2011 at 11:02

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Reopening this issue. Ikelos and I have decided that we'll just have to deal 
with structures crossing page boundaries for now (until we have a way of 
marking individual members as valid or invalid). However as for strings that 
cross page boundaries, Ikelos has come up with the attached patch which I think 
is good but need to test more.

Original comment by michael.hale@gmail.com on 2 Feb 2012 at 2:00

Changed state: Started

Attachments:

cstring.patch

GoogleCodeExporter commented 8 years ago

Hey guys, 

This is a patch that introduces two new string types to basic.py (placement 
here so they're available to all profiles, including linux, osx, etc). The 
point is to treat C strings (null terminated) in a manner more consistent with 
how OS's actually deal with them. 

Currently, our String class requires a length argument. If the string starts 
100 bytes from the end of a page, the next page is unavailable, and the 
string's length argument is 200 bytes, then the read operation on 200 bytes 
will fail and we'll get nothing (obj.NoneObject). This is bad because although 
the maximum length is 200 bytes, the string may only contain 5 characters with 
a terminating null. Instead of returning those 5 chars, we return 
obj.NoneObject. 

So the new string classes will not override or replace the existing String, 
rather they'll just provide an alternate choice for plugins that may need to 
read strings that cross page boundaries. In comment #8 above I showed an 
example where a service name was showing up none instead of "xmlprov". By using 
CString instead of String now that is fixed and I see "xmlprov" as the OS would 
see.

Original comment by michael.hale@gmail.com on 21 Feb 2012 at 3:12

Attachments:

volatility-cstring.patch

GoogleCodeExporter commented 8 years ago

Just a small comment. If CString is supposed to be null terminated, this is 
faster:

result = self.obj_vm.zread(self.obj_offset, i).split("\x00", 1)[0]

It will take up to the first NULL (or the length of the string if it has no 
NULLs. I think zread will null pad invalid pages already, which will cause this 
to only read up to the invalid page.

I think this is a good patch, but should be applied to String and 
UnicodeString. Its generally impossible to predict in advance if you are likely 
to run into this situation just so you can use a special variant of String for 
this purpose.

Its almost always the case that reading into invalid pages is surprising, so it 
is better to just truncate the read (which this will do). Optionally a log 
message might help if you are really concerned.

Original comment by scude...@gmail.com on 21 Feb 2012 at 3:25

GoogleCodeExporter commented 8 years ago

I agree with scudette, zread is definitely the better option, and could be 
applied to the standard String class.  Before that can happen, however, we need 
to ensure that BaseAddressSpace defines zread, and that it's then populated for 
all existing address spaces.

Also, we should ensure there's a "\x00" at the end of the read-in string, so 
that we guarantee the split will work...

Original comment by mike.auty@gmail.com on 21 Feb 2012 at 9:49

GoogleCodeExporter commented 8 years ago

You do not need to add a null to the end. The split will work even without it.

Original comment by scude...@gmail.com on 21 Feb 2012 at 10:01

GoogleCodeExporter commented 8 years ago

Fair enough

Original comment by mike.auty@gmail.com on 21 Feb 2012 at 10:05

GoogleCodeExporter commented 8 years ago

Hey guys, 

Here's patch draft #2. Things that have changed: 

* No more separate classes for CString and UnicodeCString. The changes have 
been applied to String so that anything already using String types will 
automatically benefit. 

* Changed the name of UnicodeCString to just UnicodeString. Now the 
_UNICODE_STRING type instantiates UnicodeString instead of String. 

* The old check in String.__str__ to make sure a string was null terminated has 
been removed, since the String.v() now ensures null termination. 

* As you will see in the code, the one remaining issue is when UnicodeString is 
used for LPWSTRs and not _UNICODE_STRINGs. The difference is that 
_UNICODE_STRING.Length specifies the exact number of bytes (not including null 
termination) pointed to by the Buffer. Thus before we do the split("\x00\x00", 
1)[0] the string is already the right length. In LPWSTRs where we just have a 
maximum string length, and the least significant byte is zero (like it will be 
in English strings), then that least significant byte is chopped off when we do 
split("\x00\x00", 1)[0]. That causes the .decode().encode() statement in 
__str__ to not properly recognize the last character, and ultimately return a 
string with the last char missing. 

So I had to do this, which is obviously quite ugly:

## Hack to deal with English LPWSTR 
        if not result.endswith("\x00"):
            result += "\x00"

Anyone have an idea how to handle this better? Any other problems with the 
patch?

Original comment by michael.hale@gmail.com on 23 Feb 2012 at 3:35

Added labels: Milestone-2.1.x
Removed labels: Milestone-1.5.x

Attachments:

volatility-cstring2.patch

GoogleCodeExporter commented 8 years ago

So this patch may work for the common cases, but it will still require the 
AddressSpace API adding the zread function.  At the moment many address spaces 
implement it, but they are not forced to, and it's entirely possible to get an 
address space that doesn't (which will then exception out if these are ever 
called against it).  We need to add a suitable default function to 
BaseAddressSpace, and then check that all of our address space implementations 
supply a concrete version of it.

As to the tail ending thing, if LPWSTRs are always in byte pairs, then don't do 
a find on the string, but split it up in a list of byte-pairs, and check if any 
of them match "\x00\x00".  That will then ensure we terminate on what we're 
supposed to, rather than mistaking "\xFF\x00\x00\xFF\x00\x00" as terminating 
early.

The UnicodeString class seems a little overkill for what amounts to checking 
for "\x00\x00" rather than "\x00".  Would it be possible to refactor the String 
class to take a terminator, and then simply override that?

I'll have a think about it more when I'm not as sleepy, please poke me on the 
weekend if I haven't said anything more by then...

Original comment by mike.auty@gmail.com on 23 Feb 2012 at 3:53

GoogleCodeExporter commented 8 years ago

Hi Michael,
   A couple of comments:

   data = self.v()
   if not data:
       return ""

   return data

Can be simplified to:

data = self.v() or ""
return data

Second issue is that this handling of unicode is incorrect on a number of 
levels. First this is using the backslashreplace error handler (in multiple 
places) which is lossy - in other words you _will not_ be able to get back to 
unicode from the output of this function. See for example

http://bytes.com/topic/python/answers/702688-reverse-encode-backslashreplace

IMHO you need to encode using the utf8 codec rather than ascii.

The second issue is that you are splitting on a double null in the UTF16 
encoded data. The string is a unicode string, so you need to decode it first, 
then split on a single null:

self.obj_vm.zread(self.obj_offset, self.length).decode("utf16-le", 
"ignore").split("\x00", 1)[0]

This will solve your hack regarding the english text. You will also return a 
unicode string from .v() which is what should happen - we should always be 
returning the correct types from .v() (e.g. we return an int from Interger.v() 
etc).

Original comment by scude...@gmail.com on 23 Feb 2012 at 4:01

GoogleCodeExporter commented 8 years ago

Ok, so here's a patch that adds zread as a required function to 
BaseAddressSpace, and includes basic implementations for any of the address 
spaces that didn't include one.

I couldn't come up with a suitable base implementation for zread that would 
generally apply, so it's currently just a virtual method that needs populating 
in BaseAddressSpace...

Original comment by mike.auty@gmail.com on 24 Feb 2012 at 8:36

Attachments:

volatility-zread.patch

GoogleCodeExporter commented 8 years ago

Seriously? Monkeypatching? Why dont we just delegate zread to read for the base 
class. Note that we have been using zread forever in the prodexedump etc 
commands. Also zread really only makes sense in the virtual address spaces 
because in the physical ones it doesnt really make sense to read outside the 
mapped range (it would mean outside the file).

I think its enough to simply delegate zread to read in the base class, for 
safety but in practice it will never be used.

Original comment by scude...@gmail.com on 24 Feb 2012 at 8:41

GoogleCodeExporter commented 8 years ago

Hey guys, 

Thanks for pursuing this issue with me ;-) 

OK so a few things...

Thanks for the data = self.v() or "" tip, I will incorporate that into the new 
patch. This line is also very useful:

self.obj_vm.zread(self.obj_offset, self.length).decode("utf16-le", 
"ignore").split("\x00", 1)[0]

I had to use "utf16" instead because it said "utf16-le" isn't a recognized 
decoder on my machine. The reason I used ascii/backslashreplace is because 
that's what windows._UNICODE_STRING was doing (see 
http://code.google.com/p/volatility/source/browse/trunk/volatility/plugins/overl
ays/windows/windows.py#162). We've known for a while that we need to fix some 
things for better unicode support, but I don't want to try and solve all 
problems at once here (although that'd be awesome ;-))

So the latest patch has some improvements - may not be perfect yet but should 
get us a little closer. The main difference is that UnicodeString.v() returns a 
Unicode string and UnicodeString.__str__() returns an ascii backslash string. 
This directly affects _UNICODE_STRING.v() and _UNICODE_STRING.__str__() in the 
same way. Note: the filescan/mutantscan plugins that just print FileName.v() 
will now output strings like u'filename' to the terminal. If we want the 
ascii/backslash, it should be changed to str(FileName) or str(FileName.v())

At least in this state, we've fixed the crossing page boundaries issue and can 
then focus on what needs to be done for better unicode encode/decode support. 

Ikelos, thanks for picking up the zread part of this issue. Without applying 
your volatility-zread.patch I would get this:

$ python vol.py -f ~/idt_sneaky.vmem printkey
Volatile Systems Volatility Framework 2.1_alpha
Legend: (S) = Stable   (V) = Volatile

----------------------------
Registry: \Device\HarddiskVolume1\Documents and Settings\LocalService\Local 
Settings\Application Data\Microsoft\Windows\UsrClass.dat
Traceback (most recent call last):
  File "vol.py", line 135, in <module>
    main()
  File "vol.py", line 126, in main
    command.execute()
  File "/Users/Michael/volatility_trunk/volatility/commands.py", line 101, in execute
    func(outfd, data)
  File "/Users/Michael/volatility_trunk/volatility/plugins/registry/printkey.py", line 97, in render_text
    outfd.write("Key name: {0} {1:3s}\n".format(key.Name, self.voltext(key)))
  File "/Users/Michael/volatility_trunk/volatility/plugins/overlays/basic.py", line 63, in __format__
    return format(self.__str__(), formatspec)
  File "/Users/Michael/volatility_trunk/volatility/plugins/overlays/basic.py", line 59, in __str__
    data = self.v() or ""
  File "/Users/Michael/volatility_trunk/volatility/plugins/overlays/basic.py", line 51, in v
    result = self.obj_vm.zread(self.obj_offset, self.length).split("\x00", 1)[0]
AttributeError: 'HiveAddressSpace' object has no attribute 'zread'

If all goes well, perhaps we could get the zread patch applied first (whether 
the existing patch or a new one that delegates zread to read) and the string 
patch second...which I think would be good to close this issue. Then we'll have 
to pursue the long awaited unicode encode/decode support problem.

Original comment by michael.hale@gmail.com on 24 Feb 2012 at 3:03

Attachments:

volatility-cstring3.patch

GoogleCodeExporter commented 8 years ago

Hiya, yep, no probs.

I can commit the zread patch pretty soon.  I'm happy to expand out the "zread = 
read" lines into the longer winded "def zread(self, addr, length):  return 
self.read(addr, length)", but they're effectively identical.  I've no idea why 
scudette thinks that's monkey patching, it's the defining the zread function to 
be identical to read, not changing the definition of existing code at run-time. 
 However, I'll go with whichever one stops people complaining at me...

I'm not going to delegate zread to read on the base class because it'll make it 
far too easy to write a new address space that doesn't fulfill the semantics 
it's supposed to.  If we wanted a free zread in the base class, I'd write a 
read-byte-by-byte that's guaranteed to return the correct value, but it'd be so 
inefficient that everyone would need to override it anyway.  Better to require 
AS writers to think about the problem and then delegate to read if it turns out 
that's ok.

Finally, if the UnicodeString is pretty much identical to String save for the 
decoding, can we not have that as a parameter, and then have the UnicodeString 
define a different decoding?  Is it possible to decode an ascii string and have 
it returned as ascii rather than unicode?

Original comment by mike.auty@gmail.com on 24 Feb 2012 at 9:39

GoogleCodeExporter commented 8 years ago

Nice work on getting the zread patch in Ikelos. So yeah I think its possible to 
have a String(encoding="utf16') parameter or something. I'd need probably a 
while longer to work up the patch and test it out - unless someone else wants 
to ;-)

Original comment by michael.hale@gmail.com on 25 Feb 2012 at 4:33

GoogleCodeExporter commented 8 years ago

So, were you thinking about something like this?  5:)

Original comment by mike.auty@gmail.com on 25 Feb 2012 at 5:06

Attachments:

volatility-string-encoding.patch

GoogleCodeExporter commented 8 years ago

What would we use for a C like String (i.e. a char array)? Would we be
then forced to define some kind of special CString class?  I thought
the original intention was to define a String as a char array and a
UnicodeString for actual text representation.  Also another idea is to
take the default encoding from the profile, then the windows Strings
will be utf16 while the linux ones will be utf8.

Michael.

Original comment by scude...@gmail.com on 25 Feb 2012 at 5:13

GoogleCodeExporter commented 8 years ago

This patch will corrupt C strings with high bit chars (i.e. not ascii due to 
the encode. We would need to introduce a CString to handle those cases where we 
really want a char array. Also this will fail if self.v() returns non ascii 
data (i.e. for other encoding like utf8):

+        return str(self.v())

Original comment by scude...@gmail.com on 25 Feb 2012 at 5:15

GoogleCodeExporter commented 8 years ago

Surely if we really want a char array, we'd instantiate it as an Array of char 
(or just read bytes from the AS)?

_UNICODE_STRING is currently defined in the windows profile, so you'd need to 
define something different for linux anyway.

Yeah, for some reason I thought the str would just automagically convert it, 
but I guess not, that will need some work, but I'm off out soon.  We should 
probably make a decision on how we expect output of a unicode string to appear 
after str()...

Original comment by mike.auty@gmail.com on 25 Feb 2012 at 5:34

GoogleCodeExporter commented 8 years ago

Yeah but the semantic meaning of C str is an array of char and we
expect to get from .v() "foo" rather than ["f", "o", "o"].
_UNICODE_STRING is a windows structure which contains a unicode string
and a length. The UnicodeString class is actually shared between all
operating systems, and represents a text string encoded in someway in
memory. IMHO String() and UnicodeString() should be different, maybe
name the first ByteString() or just Bytes().

Something like this should be used everywhere you need to output a
unicode object:

def SmartStr(string):
   """Enforce a string into a byte string."""
   if isinstance(string, unicode):
     return string.encode("utf8")
   else:
     return str(string)

then
def __str__(self):
    return SmartStr(self.v())

So in the unicode case we just get utf8 encoded output. Maybe we can
make the output encoding a flag or something to ensure this works well
on windows too. Its also possible to sniff the output encoding by
checking sys.stdout.encoding - if the program is connected to a
terminal of some sort it should be set correctly, otherwise we need to
default to something sensible.

Michael.

Original comment by scude...@gmail.com on 25 Feb 2012 at 5:50

GoogleCodeExporter commented 8 years ago

Original comment by mike.auty@gmail.com on 10 Mar 2012 at 7:30

Changed title: Null-terminated strings whose maximum length crosses page boundaries

GoogleCodeExporter commented 8 years ago

Original comment by mike.auty@gmail.com on 10 Mar 2012 at 11:37

Added labels: Task-Unicode

GoogleCodeExporter commented 8 years ago

Ok, well between r1500 for the scudette branch and r1563 for trunk, I'm going 
to mark this issue as fixed.

Original comment by mike.auty@gmail.com on 21 Mar 2012 at 11:42

Changed state: Fixed

Leor3961 / volatility

Null-terminated strings whose maximum length crosses page boundaries #57