NLnetLabs / simdzone

Fast and standards compliant DNS zone parser
BSD 3-Clause "New" or "Revised" License
63 stars 11 forks source link

Review comments #161

Closed k0ekk0ek closed 5 months ago

k0ekk0ek commented 6 months ago

@wcawijngaards is reviewing main. A summary of his comments so-far:

  1. src/westmere/bits.h:37 prefix_xor references avx2 in the comment, but that is haswell.
  2. In check_sshfp_rdata, c>=n, should that not error on c==n ?
  3. compat/getopt.h:2, there is placeholder text in 'some useful comment'.
  4. include/zone/attributes.h:31 the text is duplicated at line 20, up to line 40, for has_attribute.
  5. include/zone.h:208 in some places there are 'FIXME' comments.
  6. For a compile on mingw crosscompiler, change src/generic/endian.h #if _MSC_VER to #if _WIN32 . (fixed by #168).
  7. In src/zone.c change Windows.h to windows.h. It can then also link in crosscompile.
  8. zone-bench links with -lshlwapi added, -lssp, but without compat/getopt.c that is not needed for the croscompile.
  9. src/generic/wks.h:254 That returns -1 for a failure, but in src/generic/types.h:432 it only checks for true and false. When it returns -1, the variable port is undefined.
  10. src/fallback/name.h:51 if (t != te || w >= we). If that is w>we, then it would allow length 255, and that is a length that should be allowed?
  11. src/fallback/scanner.h:36 parser->file->lines.tail[0] += *(start + 1) == '\n'; That reads one past the start pointer, but the start pointer can equal 'end', and then it is one past the allowed end. Also at src/fallback/scanner.h:64. If it then continues to parse at the next block, where is_escaped is true, then at line 85 it again compares that byte with newline and increments the line count. That would make the line count incremented one time too many. This happens if start is 'end-1' at line 36 and 64.
  12. src/fallback/scanner.h:49 parser->file->lines.tail[0] -= *end == '\n'; . This dereferences the end pointer, buf that is one past the end of the buffer, and not allowed? Also at line 73. (see comment about padded operation below)
  13. src/fallback/text.h:22 d[2] = (uint8_t)text[3] - '0'; the routines that call unescape, and also there, it does not check for the availability of the next two bytes in the buffer. After the read beyond end of buffer, it flags it as a parse error. (see comment about padded operation below)
  14. src/generic/apl.h, dereferences text without check of length. (see comment about padded operation below) (apl requires contiguous tokens, so it has a length of at least one)
  15. in src/generic/cert.h:94 The second '&' is not needed.
  16. Also the input&= is in src/generic/dnssec.h:135.
  17. src/generic/eui:43 eui_base16_dec_loop_generic_32_inner(input+12, rdata->octets+4, 1)) This calls at position 12, but s[5] is then input[17] and that is one past the end of the buffer. Also wouldn't input that is like 'xx+xx+xx-xx-xx-xx' and with dash at 0, cause the comparison to do 0^0^0, is 0 and not fail to reject the invalid syntax.
  18. src/generic/parser.h:274 fread fills complete buffer while variable was just increased to previous + ZONE_WINDOW_SIZE, probably an error.
  19. build of tests gives a warning: tests/include.c:79 warning: the use of tempnam' is dangerous, better use 'mkstemp'. Unbound uses snprintf(buf, sizeof(buf), "/tmp/unbound.unittest.%u.%d", (unsigned)getpid(), i);, which compiles without warnings. Code checkers will fall over this, again and again. It is therefore preferred not to use these kinds of functions.
  20. No error is printed when the $INCLUDE directive references a non-existing file(?)
lemire commented 6 months ago

The buffer overflows should be trivially detected by compiling with address sanitizers, which you can do in CI. We do it for simdjson and other projects of mine.

k0ekk0ek commented 6 months ago

A comment by a colleague to look into.

Something that is maybe hard to find, but now that I'm looking at the hex and b64 encoders, is that the "0" string may be used to denote a a zero length rdata field. For the NSEC3 salt that's - for the zero length salt field. This is possibly something from dnssec-bis-updates

k0ekk0ek commented 6 months ago

Support for enabling sanitizers was already available in the CMakeLists.txt. I've enabled it in #166.

wcawijngaards commented 6 months ago
  1. src/generic/gpos.h:180,183 uses 9u > .. but earlier the check is for 10u > .. for similar loop code in parse_longitude and parse_latitude. There are no '9' characters in the altitude?
  2. src/generic/ip6.h:38 inet_pton4 The comment says the return value is 0 or 1, but it returns -1 or a length. Also at line 88, inet_pton6 comment is wrong.
  3. src/generic/ip6.h:71 inet_pton4, return 0 probably should be return -1.
  4. src/generic/ip6.h:136 inet_pton6, return 0 probably should be return -1.
  5. src/generic/ip6.h:141 inet_pton6, return 0 probably should be return -1.
  6. src/generic/nsap.h:31 If the input is aa, then both the comparisons with '0' and 'X' are false, and 0^0 is 0, so it allows that sequence. But I guess it should reject it, because then it skips it with data+=2 later.
  7. src/generic/nsec.h:16 The size of one window is 256/8 = 32, so 32 + 2 instead of 256 + 2, that should suffice as the size of the bitmap for one window. That also changes the calculation of the static assert on line 30.
  8. src/generic/number.h:89 after the memcpy it also assigns the number to the rdata->octets output one line later, it copies number twice.
  9. src/generic/parser.h:268: the data passed to realloc is not a pointer that was previously returned by malloc, but parser->file->buffer.data + parser->file.buffer.index, perhaps use realloc(parser->file->buffer.data, ..).
  10. src/generic/parser.h:663: maybe_take_quoted has length without -1, but other quoted token length has delimiters.head-fields.head-1. It sets data++ allright.
  11. src/generic/svcb.h:517 memcmp has length 0, but needs 3 to compare "key".
  12. src/generic/svcb.h:608 this disallows key 0, which is not allowed anyway, but calls it a duplicate. It also does not detect keys always because at line 601 that would need to break for <= to find equal items, also at 702. It is <= at line 962.
  13. src/generic/time.h:58 There is no limit on the year to a maximum, the overflow is handled with serial arithmetic. Also it should be possible to read a 32bit timestamp from an integer number of seconds.
  14. src/generic/type.h:80 The classes array is one larger, like <= comparison. Also in src/westmere/type.h:80.
  15. src/generic/types.h:141 sometimes TYPE0 is used for debug set ups, so it should be valid to encounter, in a type bitmap, or as unknown RR type.
  16. src/generic/types.h:199: If there is type 30 and 300, it has windows 0 and 1. Then !window != !last_window when it reads type 300, since the current window is 1 and the last is 0, the inversion is 0 != 1 and it rejects this? It may also be useful to reject duplicates, window == last_window. It may also be useful to reject blocks == 0, the empty NSEC has no windows.
  17. src/generic/types.h:253 These comments look like they are from older edits.
  18. src/generic/types.h:1125 The NXT field for name is 0, change f[3] to f[0].
  19. src/generic/types.h:1373 The second field, has o, n instead of o+c, n-c.
  20. src/generic/types.h:1379 The missing field could be field 0 or 1, so something like NAME(&f[n!=0]).
  21. src/generic/types.h:1482 No gateway has no rdata octets, there needs to be break; after this case, it cannot fall through to allow a name.
  22. src/generic/wks.h:75 These comments are leftovers from edits.
  23. src/generic/wks.h:95 Parentheses are wrongly placed and variable digit contains the ascii character, not the char - '0' value. Line 251 has working assignment to digit.
  24. src/zone.c:270 Since file->name is NULL, and file->path is not_a_file, it attempts to free not_a_file after string parse.
  25. tests/svcb.c:164 The comment says length 9, but it has length 5.
  26. tests/time.c:62 RRSIG timestamps are compared with serial arithmetic, and that means the lower 32bits of the timestamp has to be stored. A date with a large value should have its lower 32bits as the timestamp. In RFC 4034 3.1.5 it says serial number arithmetic is used. And in RFC 4035 5.3.1 it says that the values are compared with the current time. They must be within 68 years of the current time to be considered valid. The parser should not reject other dates, but should store the lower 32bits.
  27. tests/time.c:97 size is too small by roughly the length of the tests[i].timestamp string.

This finished the review, the code looks nice! The speed up in zone parsing is very nice for large datasets.

wcawijngaards commented 6 months ago

Nr. 44. is now fixed by the commit that fixes the comparison to use file->path. So it no longer attempts to free not_a_file.

k0ekk0ek commented 5 months ago

Nr 32. not entirely correct. It only enters the loop if key <= highest_key (so at least one key has been read). If it then detects the key is duplicate, the message is correct because there's indeed a duplicate key. I check for automatically mandatory keys after parsing the list.

wcawijngaards commented 5 months ago

You are right because of that it works to detect key 0, because there is always at least one key already read. There is then still the other parts about not always finding equal keys, or about the difference in code lines: line 601 if (key < smaller_key), line 695 if (key < smaller_key) and line 965 if (key <= smaller_key).

wcawijngaards commented 5 months ago

That was just fixed in the commit just now, the difference in code lines.

k0ekk0ek commented 5 months ago

I've updated the documentation about "reading past the buffer" in #169. It's not nearly enough to properly explain, but it basically comes down to the scanner guaranteeing only complete tokens are returned. We really don't want to concern the parser with having to deal with partial tokens or anything like that. The input is also always padded with 64-bytes. Combining those two facts, a lot of parsers trust that while they read a character that is part of the contiguous set of characters the next one is either part of that set or a delimiter. For example, if I'm reading an escaped sequence, as long as I'm reading digits it cannot be a delimiter (either for quoted or contiguous) and since the input is guaranteed to be padded I'm save in reading the data. It may be worthwile changing this for the fallback implementation eventually though (see #174)