angr / simuvex

[DEPRECATED] A symbolic execution engine for the VEX IR
BSD 2-Clause "Simplified" License
79 stars 57 forks source link

Add ctype locale funcs #89

Closed ekilmer closed 7 years ago

ekilmer commented 7 years ago

I ran into some locale functions in a program I was analyzing, so I wrote some implementations.

The lookup tables were taken from GLIBC 2.25:

I tried adding some tests to check that our implementation gives the same results as the test system, so I wrote some C files that use the functions and looped through their output to compare with ours. One caveat is that toupper and tolower return uint32_t and our implementation would just return a 1 byte per index value. I thought this would save on performance (maybe?) and simplify the implementation. The only place this is a concern is the result of index -1 here and here. This can be changed relatively easily if need be. In the test C programs, I explicitly take care of this and output a 1-byte result.

One thing that I haven't figured out is why the test_inspect.test_inspect is failing after my changes...

ltfish commented 7 years ago

Thank you! This is some good stuff!

We are all super busy with a deadline, so hopefully someone will come back and handle the PR this weekend.

One caveat is that toupper and tolower return uint32_t and our implementation would just return a 1 byte per index value. I thought this would save on performance (maybe?) and simplify the implementation

I didn't get it. Which toupper and tolower are you referring to?

One thing that I haven't figured out is why the test_inspect.test_inspect is failing after my changes...

Maybe it's time for us to debug it?

ekilmer commented 7 years ago

I didn't get it. Which toupper and tolower are you referring to?

When looking at the C source code in C-ctype.c, the definition for tolower is:

 238 const uint32_t _nl_C_LC_CTYPE_tolower[384] attribute_hidden =
 239 {
 240   /* 0x80 */ 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
 241   /* 0x88 */ 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
 242   /* 0x90 */ 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97,
 243   /* 0x98 */ 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f,
 244   /* 0xa0 */ 0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7,
 245   /* 0xa8 */ 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf,
 246   /* 0xb0 */ 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7,
 247   /* 0xb8 */ 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf,
 248   /* 0xc0 */ 0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7,
 249   /* 0xc8 */ 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf,
 250   /* 0xd0 */ 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7,
 251   /* 0xd8 */ 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf,
 252   /* 0xe0 */ 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7,
 253   /* 0xe8 */ 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef,
 254   /* 0xf0 */ 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7,
 255   /* 0xf8 */ 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xffffffff,
 256   /* 0x00 */ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
 257   /* 0x08 */ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
 258   /* 0x10 */ 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
 259   /* 0x18 */ 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,
 260   /* 0x20 */ 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27,
...
 283   /* 0xd8 */ 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf,
 284   /* 0xe0 */ 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7,
 285   /* 0xe8 */ 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef,
 286   /* 0xf0 */ 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7,
 287   /* 0xf8 */ 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff
 288 };

where all of these values could be stored as 1 byte, since they're supposed to represent characters. But, on line 255, at the end, they use 0xffffffff which is explicitly 4 bytes. (Same thing goes for toupper)

I'm not too familiar with string/charset locales and the like, but thinking more about it now, it may be a uint32_t because different locales could use multi-byte characters? The easy answer would be to just conform to the size used in the C source code. That might be the best idea to save any potential headaches down the road... Do you think this would affect performance or is accuracy the utmost importance? I can change it pretty easily.

Maybe it's time for us to debug it?

I tried running the test file python test_inspect.py on my machine, and it passes... I'm not too familiar on how I would go about setting up an environment like is found on Travis in order to debug it that way (aside from copying all the commands manually into a VM). If you have any resources on that, I'd be happy to give it a shot.

Thank you!

ltfish commented 7 years ago

Do you think this would affect performance or is accuracy the utmost importance?

Accuracy is the priority. And to be honest, I don't see how it's going to impact performance.

ekilmer commented 7 years ago

It looks like the Simuvex tests passed now. I think it had something to do with declaring an angr Project in global scope in my test file... I just moved it into each test function now. There's probably a better way to use the same Project for all test function.

Take your time to review it, and good luck with your deadlines!

ekilmer commented 7 years ago

@ltfish I think the Travis test might need to be restarted? Looks like it failed cloning one of the repos.

Also, is the way I set up the binaries for testing acceptable? Or would it be better to move them into the angr/binaries repo? I thought it would make sense to have the binaries organized this way because they are small and very specific to each libc function.

Do you have any thoughts on future organization or methods for testing libc functions? Thank you.

rhelmot commented 7 years ago

we have many binaries in the binaries repo that are only used for a single testcase! Don't worry about crowding it up for testing individual functions.

If you're worried about crowding up the angr test suite, you could stick the tests in simuvex, but you have to not use angr for loading then.

ekilmer commented 7 years ago

Understood. I'll work on refactoring this in a bit to move the tests into angr/binaries repo. Thanks!

ekilmer commented 7 years ago

Submitted these referenced PRs also: angr/angr#369 and angr/binaries#11