Vector35 / binaryninja-api

Public API, examples, documentation and issues for Binary Ninja
https://binary.ninja/
MIT License
839 stars 194 forks source link

String detection misses short strings #5547

Open joelreymont opened 3 weeks ago

joelreymont commented 3 weeks ago

Version and Platform (required):

Internal binary major dine favor.

DjiAircraftInfo_GetAircraftTypeByCameraVersion...

Note that BN picks up some of the strings but misses the others

00047dd0      else if (strcmp(s1: &reply_payload:2, s2: &data_127668) == 0)
00047e50          *type = DJI_AIRCRAFT_TYPE_M30
00047e44      else if (strcmp(s1: &reply_payload:2, s2: "M30T") == 0)
00047e7c          *type = DJI_AIRCRAFT_TYPE_M30T
00047e70      else if (strcmp(s1: &reply_payload:2, s2: &data_127678) == 0)
00047ea8          *type = DJI_AIRCRAFT_TYPE_M3E
00047e9c      else if (strcmp(s1: &reply_payload:2, s2: &data_127680) == 0)
00047ed4          *type = DJI_AIRCRAFT_TYPE_M3T
00047ec8      else if (strcmp(s1: &reply_payload:2, s2: &data_127688) == 0)
00047f00          *type = DJI_AIRCRAFT_TYPE_M3D
00047ef4      else if (strcmp(s1: &reply_payload:2, s2: "M20E") == 0)
00047f2c          *type = DJI_AIRCRAFT_TYPE_M3D
00047f20      else if (strcmp(s1: &reply_payload:2, s2: "M3TD") == 0)
00047f58          *type = DJI_AIRCRAFT_TYPE_M3TD
00047f4c      else if (strcmp(s1: &reply_payload:2, s2: "M20T") != 0)

What the data looks like

00127668  data_127668:
00127668                          4d 33 30 00 00 00 00 00                                                          M30.....

00127670  char const data_127670[0x5] = "M30T", 0

00127675                                                                 00 00 00                                               ...
00127678  data_127678:
00127678                                                                          4d 33 45 00 00 00 00 00                          M3E.....
00127680  data_127680:
00127680  4d 33 54 00 00 00 00 00                                                                          M3T.....
00127688  data_127688:
00127688                          4d 33 44 00 00 00 00 00                                                          M3D.....

00127690  char const data_127690[0x5] = "M20E", 0

00127695                                                                 00 00 00                                               ...

00127698  char const data_127698[0x5] = "M3TD", 0

0012769d                                                                                         00 00 00                               ...

001276a0  char const data_1276a0[0x5] = "M20T", 0
joelreymont commented 3 weeks ago

Going through the data and marking the missing bits as C strings (A) does the trick.

xusheng6 commented 3 weeks ago

This is related to how our string detection works -- it scans for strings and create a list of strings. During this process, short strings (length <4) are ignored to avoid too many false positives.

This should be improved and integrate information from multiple sources. For example, if we see a code xref to a candidate string, even if it is short, we should still create a string there. A more common case is code like printf("%d", ...), and right now we miss the string "%d" because it is short

xusheng6 commented 3 weeks ago

Related to https://github.com/Vector35/binaryninja-api/issues/3428. Note they are not duplicates -- this issue talks about detecting these strings automatically, while #3428 talks about maintaining them inside of the list of strings

ccarpenter04 commented 3 weeks ago

This is related to how our string detection works -- it scans for strings and create a list of strings. During this process, short strings (length <4) are ignored to avoid too many false positives.

This should be improved and integrate information from multiple sources. For example, if we see a code xref to a candidate string, even if it is short, we should still create a string there. A more common case is code like printf("%d", ...), and right now we miss the string "%d" because it is short

So I've never had a complaint with the string detection logic... but now since you brought it up I suddenly find myself noticing cases where they could use some improvement and it would make my life easier if the usage context were to be evaluated around any "potential strings" that were initially skipped over for cases just like the printf example mentioned. I set my minimum string length to 3 typically so there are fewer cases that are skipped over (and a bit more garbage) but I would probably move it back to 4 if there were some context based cases being taken into consideration.