k0ekk0ek commented 1 year ago

64 discards the secondary index (specifically for length) that was previously used for parsing domain names (a data type that occurs quite a lot in zone data :sweat_smile:). With that change I focused on scanning for (described on #30) characters that are not part of the contiguous set. Since some of the characters overlap we need 2`pshufb` + 2`cmpeq` + `or` (the way it is now, #30 describes filtering the input), we then follow-up with doing a `cmpeq('\\')` and a `cmpeq('"')` (this is all somewhat modeled after the string parsing in simdjson btw). Reading up on #65, I also read Wojciech Muła's SIMD-ized faster parse of IPv4 addresses article. In there he uses a clever trick to find digits.

const __m128i ascii0 = _mm_set1_epi8(-128 + '0');
const __m128i rangedigits = _mm_set1_epi8(-128 + ('9' - '0' + 1));

const __m128i t1 = _mm_sub_epi8(input, ascii0);
const __m128i t2 = _mm_cmplt_epi8(t1, rangedigits);

Maybe we can use a range check too to parse domain names faster. Most names use the Preferred name syntax and rarely do names contain escape sequences. If an escape sequence is encountered, the algorithm stops at that character, processes the single escaped character and picks-up from the first character after the sequence.

And instead of checking for \ separately we can replace that with checking if the block contains something interesting to start with?

k0ekk0ek commented 1 year ago

Rules regarding domain names and behavior of other implementations is now documented in FORMAT.md (#71), so that we at least have an idea of the constraints are.

k0ekk0ek commented 1 year ago

Checking for delimiters can be as simple as outlined on #6 (specific comment). The real speed-up will be in replacing the dots by the label lengths. Currently that is a scalar loop using bit manipulation instructions. I've been contemplating if there's a convenient way to write them out in bulk. So, first do vectorized classification for delimiters, backslashes (escape sequences) and dots.

Then, something like broadcasting the current length to a register, add 0-15 using simd, get the relevant numbers next to each other and use something like hsub (vpcompress would sure help, but maybe there's another way? Perhaps PEXT? 64-bits is enough to hold 0-15 values for a 16-byte vector...). We'd need to save the last position in case the label spans blocks (very likely).

@lemire, @aqrit, I know #28 is still in progress, but once that's done (great work btw, very much appreciated!), this is probably a very interesting issue to dive into? This format is so common and yet no results when searching for optimized parsing... It goes without saying, but only if you're interested and can spare the time.

lemire commented 1 year ago

I will have a look.

aqrit commented 1 year ago

I have thoughts on emulating vpcompressb. However, I doubt that anything would beat a simdjson style bitmask_to_offset scalar loop here...?

The general problem with PEXT is that it isn't available pre-avx2 and is very slow (~250 cycles) on AMD before Zen3.

lemire commented 1 year ago

If you require AVX-512 then you can safely use pext.

k0ekk0ek commented 1 year ago

Possibly, but if we require AVX-512 then just using compress would be easier? @aqrit, indeed, it's likely the algorithm cannot be made faster, but its worth taking a second look? Ideally we can speed it up using SSE/AVX, but even if there's only a speedup for AVX-512, that'd be very interesting. It may be possible to use a trick like the one @lemire used for IPv4 conversion (generate hash from the bitmask), but my initial thought is there are too much possibilities and the table would get too big? Benchmarking is probably a good way to start. This was my first stab, so having a feel for the overall speed is probably the way to get started. At least, that's what I intend to do once I get some other things done.

lemire commented 1 year ago

@k0ekk0ek It would be helpful to define precisely exactly what the problem. I thought I understood, but now I am very confused.

Can you describe a benchmark? That is, describe (at least with examples) what the input is, and what the desired output is.

This was my first stab, so having a feel for the overall speed is probably the way to get started.

Just so we are clear... this is the function we are talking about, right?

zone_nonnull_all
static zone_really_inline int32_t parse_name(
  zone_parser_t *parser,
  const zone_type_info_t *type,
  const zone_field_info_t *field,
  const token_t *token)
{
  int32_t r;
  size_t n = 0;
  uint8_t *o = &parser->rdata->octets[parser->rdata->length];

  if (zone_likely(token->code == CONTIGUOUS)) {
    // a freestanding "@" denotes the current origin
    if (token->data[0] == '@' && !is_contiguous((uint8_t)token->data[1]))
      goto relative;
    r = scan_contiguous_name(parser, type, field, token, o, &n);
    if (r == 0)
      return (void)(parser->rdata->length += n), ZONE_NAME;
    if (r < 0)
      return r;
  } else if (token->code == QUOTED) {
    r = scan_quoted_name(parser, type, field, token, o, &n);
    if (r == 0)
      return (void)(parser->rdata->length += n), ZONE_NAME;
    if (r < 0)
      return r;
  } else {
    return have_string(parser, type, field, token);
  }

relative:
  if (n > 255 - parser->file->origin.length)
    SYNTAX_ERROR(parser, "Invalid %s in %s", NAME(field), TNAME(type));
  memcpy(o+n, parser->file->origin.octets, parser->file->origin.length);
  parser->rdata->length += n + parser->file->origin.length;
  return ZONE_NAME;
}

Presumably, the hot path is scan_contiguous_name? It calls scan_name. I was expecting scan_name to parse a domain name, but it appears to be parsing an ipv4 sequence?

k0ekk0ek commented 1 year ago

@lemire, indeed, that's the function. Domain names are usually non-quoted (I haven't seen any instance where they're quoted, but it's allowed according to the specification).

The input would be domain names (99.99% of the time it will just be valid host names, but we need to handle all inputs). Domain names may contain any octet between 0-255, while host names are limited to [0-9a-zA-Z-]. For testing/benchmarking we could use the Cisco Umbrella Popularity List and as a workaround do not require the trailing dot (root). e.g. the browser will not require foo.bar to have a trailing dot for user friendliness, but zone data does or the name is treated as relative and the origin will be appended.

The output would be wire format. e.g. foo.bar. would be 3foo3bar0 on the wire (numbers are actual length not ASCII numbers).

The goal is to convert the dots to lengths of the labels (e.g. foo.bar. has three labels of: foo, bar and <root>). The copying is currently modeled after just string copying in simdjson. So, load a vector, write it out and check for delimiter and backslashes. I doubt we can speed that part up. However, there is a scalar loop that handles replacing the dots by label lengths, we may be able to optimize that further(?)

My thinking is to write out the lengths out as much as possible using the same _mm_storeu_epi8. Getting the dotmask is as easy as _mm_cmpeq_epi8 on . (dot). If we have a vector containing the absolute lengths and use compress we can calculate the label lengths? e.g. for foo.bar. the vector would be [0, 1, 2, 3, 4, 5, 6, 7]. If we're able to isolate the lengths at the dots and get them next to each other, in this case [3, 6], we can subtract using SIMD and end-up with [3, 3], if we can then get those back to their original position, we can write them out using SIMD(?)

I should note that this is just an idea, I have not tested this yet. I thought I'd share to see if others people have ideas....

lemire commented 1 year ago

@k0ekk0ek Very helpful.

aqrit commented 1 year ago

the table would get too big?

For SSE4.1:

A table to cover all combinations 16 bytes would require 1 MiB.

~~In theory, If a dot can never be next to another dot then table size could be 0.5 MiB (but that still leaves the challenge of generating a table index...). [edit: I figured this wrong... it would be smaller than 0.5 MiB]~~

edit2: If no dots are adjacent then 16-bytes would have 2584 combinations. Which would work out to a table size of ~40KiB. Each byte position is assigned a weight, weights are summed to get the combination index. Not sure if this would work out better than a pack..?

A table that covers all combinations of 8 bytes would require ~440 bytes. So the trick is to fixup the last length in the previous 8 bytes somehow... However, this might still end up competitive...

k0ekk0ek commented 1 year ago

With domain names having dots next to each other is illegal. Each fully qualified domain name has exactly one null-label (the root, or trailing dot, which is often implicit in URLs). Smaller tables could work, but like you say we'd have to account for the first block in the second block, which basically comes down to subtracting the number of dots so far(?) Or, adding the number of dot's so far so we can use the same table(?) Of course, if we decide to do it in multiple stages, it's easier to port to AVX2(?) A fully qualified domain name will exceed 16 bytes quite quickly (looking at the zone data for .com).

aqrit commented 1 year ago

For SSE4.1,

Previously, I was thinking we'd want the lengths pre-computed and pre-positioned. However, doing both a compress and an expand operations, I (now) think we'd need 1 KiB... Obviously, the compress table could be smaller, but I haven't look into how to shrink the expand table, yet. Also this would be ~20 instructions to do 16 bytes of input?

Edit: An expand operation could be based on a prefix-sum so we'd probably not use a table. So we could use a small ~256 byte table for compress.

pseudocode:

const __m128i idx = _mm_set_epi8(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
const __m128i lobyte_mask = _mm_set1_epi16(0x00FF);

__m128i v = _mm_loadu_si128(src);

__m128i is_dot = _mm_cmpeq_epi8(v, _mm_set1_epi8(0x2E));

// if (hi_byte is a dot) then lo_byte++
__m128i offsets = _mm_and_si128(_mm_sub_epi8(idx, _mm_srli_epi16(is_dot, 8)), lobyte_mask);

__m128i packed = _mm_packus_epi16(is_dot, offsets);

uint32_t shuf_id = (uint32_t)_mm_movemask_epi8(packed) & 0x7F;

// gather ... 1 KiB lookup table
_mm_shuffle_epi8(packed, _mm_loadl_epi64(table[shuf_id])) // compress offsets

// delta encode (shift and subtract)
__m128i len = todo...

// expand... gather to undo the previous gather...
(table is packed into the hi_nibble of the compress table indices)
todo...

// insert lengths into dot positions of source string
len = _mm_unpacklo_epi8(...) // expand each byte to fill a word
v = _mm_blendv_epi8(v, len, is_dot);

lemire commented 1 year ago

I have a prototype. I have not looked at @aqrit code yet.

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/08/pending

My results show some promise...

name_to_dnswire_simd           :   0.92 GB/s   97.9 Ma/s  10.22 ns/d   3.19 GHz  32.64 c/d  84.02 i/d    3.5 c/b   8.98 i/b   2.57 i/c 
name_to_dnswire                :   0.51 GB/s   54.0 Ma/s  18.52 ns/d   3.19 GHz  59.14 c/d  116.61 i/d    6.3 c/b  12.46 i/b   1.97 i/c

My approach is not something I would propose, but it works... with some limitations.

lemire commented 1 year ago

Here is my code....


static inline __m128i left_shift_bytes(__m128i x, int count) {
  // We would like to shift by count bytes, but it cannot be done directly
  // without immediates
  __m128i p1 = _mm_sll_epi64(x, _mm_cvtsi64_si128(count * 8));
  __m128i p2 = _mm_srl_epi64(_mm_unpacklo_epi64(_mm_setzero_si128(), x),
                             _mm_cvtsi64_si128(64 - count * 8));
  return _mm_or_si128(p1, p2);
}

// This version processes at most 15 bytes from the input. A fallback would
// be necessary to use such code in production. TODO.
size_t name_to_dnswire_simd(const char *src, uint8_t *dst) {
  const char *srcinit = src;
  // Each label may contain from 1 to 63 octets. The empty label is
  // reserved for the root node and when fully qualified is expressed
  // as the empty label terminated by a dot.
  // The full domain name may not exceed a total length of 253 ASCII characters
  // in its textual representation.
  //
  // It is likely that many name fit under 16 bytes, however.

  // We do vectorized classification to validate the content.
  // We want to allow 0x30 to 0x39 (digits)
  // The hyphen 0x2d.
  // The dot 0x2e.
  // The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
  // The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)

  const char DIGIT = 0x01;
  const char HYPHENDOT = 0x02;
  const char LETTERAO = 0x04;
  const char LETTERPZ = 0x08;
  static int8_t zero_masks[32] = {0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
                                  0,  0,  0,  0,  0,  -1, -1, -1, -1, -1, -1,
                                  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
  __m128i highnibbles =
      _mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
                    LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
  __m128i lownibbles =
      _mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
                    LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
  // we always insert a fake '.' initially.
  __m128i input = _mm_loadu_si128((const __m128i *)src);
  input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
  src -= 1; // we pretend that we are pointing at the virtual '.'.

  // We could possibly 'upper case everything' if we wanted to.
  //   __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  __m128i low = _mm_shuffle_epi8(
      lownibbles,
      input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
              // if high bit is set, there is no match
  __m128i high = _mm_shuffle_epi8(
      highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
  __m128i classified =
      _mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
  // m cannot be zero!!!
  unsigned m = (unsigned)_mm_movemask_epi8(
      classified); // should be 1 wherever we have a match.
  uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
  src += length;
  __m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
  // masking with '.'
  input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
  //
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));

  unsigned int mask = (unsigned)_mm_movemask_epi8(dots);
  __m128i sequential =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
  const __m128i dotscounts = _mm_and_si128(dots, sequential);

  // We proceed to compute a shuffle mask that brings all counters/dots
  // together. We could do it with a single large table (2**16 * 128 bytes)
  // or 8 MB. Instead, we work from a 2 kB table and do more computation.
  mask = mask ^ 0xFFFF; // negate
  unsigned mask1 = (unsigned)(mask & 0xFF);
  int pop = 8 - __builtin_popcount(mask1);
  unsigned mask2 = (mask >> 8);
  __m128i m1 = _mm_loadl_epi64((const __m128i *)(thintable_epi8 + mask1));
  __m128i m2 = _mm_loadl_epi64((const __m128i *)(thintable_epi8 + mask2));
  __m128i m2add = _mm_add_epi8(m2, _mm_set1_epi8(8));
  __m128i m2shifted = left_shift_bytes(m2add, pop);
  __m128i shufmask = _mm_or_si128(m2shifted, m1);
  // The shuffle mask has been computed.
  // We also need the *reverse* mask which we compute with a prefix sum !
  __m128i dotones = _mm_and_si128(dots, _mm_set1_epi8(1));
  dotones = _mm_add_epi8(dotones,
                         _mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 1));
  dotones = _mm_add_epi8(dotones,
                         _mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 2));
  dotones = _mm_add_epi8(dotones,
                         _mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 4));
  dotones = _mm_add_epi8(dotones,
                         _mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 8));
  dotones = _mm_sub_epi8(dotones, _mm_set1_epi8(1));
  // Ok, dotones contains the reverse shuffle mask
  // Pheeewww... This was a lot of work.
  const __m128i packed_dotscounts = _mm_shuffle_epi8(dotscounts, shufmask);
  // Need to subtract the counters.
  // If there is an overflow, then we had two successive dots, we should error:
  // TODO.
  __m128i diffed_packed_dotscounts =
      _mm_sub_epi8(_mm_alignr_epi8(_mm_setzero_si128(), packed_dotscounts, 1),
                   packed_dotscounts);
  // need to subtract one to the counters.
  diffed_packed_dotscounts =
      _mm_sub_epi8(diffed_packed_dotscounts, _mm_set1_epi8(1));
  // send it back...
  __m128i magic = _mm_shuffle_epi8(diffed_packed_dotscounts, dotones);
  // shift it
  __m128i marked_input = _mm_blendv_epi8(input, magic, dots);
  _mm_storeu_si128((__m128i *)dst, marked_input);
  // dst += 16;
  return (size_t)(src - srcinit);
}

It only processes the first 15 bytes of the input.

lemire commented 1 year ago

(Please note that this code is not something I seriously consider, I am only stating that it works and it is faster than a scalar implementation.)

lemire commented 1 year ago

My current thoughts are that a compress/expand routine is probably not great for SSE. This would work well with AVX-512, but not so well here. I am going to try another design.

lemire commented 1 year ago

Ok. I have an approach without compress/expand. No table needed. It is 2.7 times faster than the scalar approach. It is also relatively simple.

name_to_dnswire_simd_fast      :   1.35 GB/s  144.7 Ma/s   6.91 ns/d   3.19 GHz  22.08 c/d  59.02 i/d    2.4 c/b   6.31 i/b   2.67 i/c 
name_to_dnswire_simd           :   0.92 GB/s   98.1 Ma/s  10.19 ns/d   3.19 GHz  32.55 c/d  84.02 i/d    3.5 c/b   8.98 i/b   2.58 i/c 
name_to_dnswire                :   0.50 GB/s   53.0 Ma/s  18.86 ns/d   3.19 GHz  60.23 c/d  116.61 i/d    6.4 c/b  12.46 i/b   1.94 i/c

The code is as follows... No doubt it can be made better...


size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {
  const char *srcinit = src;
  // Each label may contain from 1 to 63 octets. The empty label is
  // reserved for the root node and when fully qualified is expressed
  // as the empty label terminated by a dot.
  // The full domain name may not exceed a total length of 253 ASCII characters
  // in its textual representation.
  //
  // It is likely that many name fit under 16 bytes, however.

  // We do vectorized classification to validate the content.
  // We want to allow 0x30 to 0x39 (digits)
  // The hyphen 0x2d.
  // The dot 0x2e.
  // The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
  // The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)

  const char DIGIT = 0x01;
  const char HYPHENDOT = 0x02;
  const char LETTERAO = 0x04;
  const char LETTERPZ = 0x08;
  static int8_t zero_masks[32] = {0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
                                  0,  0,  0,  0,  0,  -1, -1, -1, -1, -1, -1,
                                  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
  __m128i highnibbles =
      _mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
                    LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
  __m128i lownibbles =
      _mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
                    LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
  // we always insert a fake '.' initially.
  __m128i input = _mm_loadu_si128((const __m128i *)src);
  input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
  src -= 1; // we pretend that we are pointing at the virtual '.'.

  // We could possibly 'upper case everything' if we wanted to.
  //   __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  __m128i low = _mm_shuffle_epi8(
      lownibbles,
      input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
              // if high bit is set, there is no match
  __m128i high = _mm_shuffle_epi8(
      highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
  __m128i classified =
      _mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
  // m cannot be zero!!!
  unsigned m = (unsigned)_mm_movemask_epi8(
      classified); // should be 1 wherever we have a match.
  uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
  src += length;
  __m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
  // masking with '.'
  input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
  //
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));

  __m128i sequential =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
  __m128i dotscounts = _mm_and_si128(dots, sequential);

  __m128i marked = dots;
  dotscounts = _mm_blendv_epi8(
      _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1), dotscounts, marked);
  marked =
      _mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 1));

  dotscounts = _mm_blendv_epi8(
      _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 2), dotscounts, marked);

  marked =
      _mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 2));
  dotscounts = _mm_blendv_epi8(
      _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 4), dotscounts, marked);
  marked =
      _mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 4));
  dotscounts = _mm_blendv_epi8(
      _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 8), dotscounts, marked);

  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_sub_epi8(next, dotscounts);

  // need to subtract one to the counters.
  dotscounts = _mm_sub_epi8(dotscounts, _mm_set1_epi8(1));

  // shift it
  __m128i marked_input = _mm_blendv_epi8(input, dotscounts, dots);
  _mm_storeu_si128((__m128i *)dst, marked_input);
  // dst += 16;
  return (size_t)(src - srcinit);
}

lemire commented 1 year ago

Ok. I am now more than 3x faster...

name_to_dnswire_simd_fast      :   1.62 GB/s  172.9 Ma/s   5.78 ns/d   3.20 GHz  18.48 c/d  52.02 i/d    2.0 c/b   5.56 i/b   2.82 i/c 
name_to_dnswire_simd           :   0.92 GB/s   98.1 Ma/s  10.19 ns/d   3.19 GHz  32.55 c/d  84.02 i/d    3.5 c/b   8.98 i/b   2.58 i/c 
name_to_dnswire                :   0.50 GB/s   53.8 Ma/s  18.57 ns/d   3.19 GHz  59.31 c/d  116.61 i/d    6.3 c/b  12.46 i/b   1.97 i/c

Code:

size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {
  const char *srcinit = src;
  // Each label may contain from 1 to 63 octets. The empty label is
  // reserved for the root node and when fully qualified is expressed
  // as the empty label terminated by a dot.
  // The full domain name may not exceed a total length of 253 ASCII characters
  // in its textual representation.
  //
  // It is likely that many name fit under 16 bytes, however.

  // We do vectorized classification to validate the content.
  // We want to allow 0x30 to 0x39 (digits)
  // The hyphen 0x2d.
  // The dot 0x2e.
  // The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
  // The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)

  const char DIGIT = 0x01;
  const char HYPHENDOT = 0x02;
  const char LETTERAO = 0x04;
  const char LETTERPZ = 0x08;
  static int8_t zero_masks[32] = {0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
                                  0,  0,  0,  0,  0,  -1, -1, -1, -1, -1, -1,
                                  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
  __m128i highnibbles =
      _mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
                    LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
  __m128i lownibbles =
      _mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
                    LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
                    LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
  // we always insert a fake '.' initially.
  __m128i input = _mm_loadu_si128((const __m128i *)src);
  input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
  src -= 1; // we pretend that we are pointing at the virtual '.'.

  // We could possibly 'upper case everything' if we wanted to.
  //   __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  __m128i low = _mm_shuffle_epi8(
      lownibbles,
      input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
              // if high bit is set, there is no match
  __m128i high = _mm_shuffle_epi8(
      highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
  __m128i classified =
      _mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
  // m cannot be zero!!!
  unsigned m = (unsigned)_mm_movemask_epi8(
      classified); // should be 1 wherever we have a match.
  uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
  src += length;
  __m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
  // masking with '.'
  input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
  //
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));

  __m128i sequential =
      _mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
                    -118, -117, -116, -115, -114, -113);
  __m128i dotscounts = _mm_and_si128(dots, sequential);

  dotscounts =
      _mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1),
                      dotscounts, dotscounts);

  dotscounts =
      _mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 2),
                      dotscounts, dotscounts);

  dotscounts =
      _mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 4),
                      dotscounts, dotscounts);

  dotscounts =
      _mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 8),
                      dotscounts, dotscounts);

  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_sub_epi8(next, dotscounts);

  // need to subtract one to the counters.
  dotscounts = _mm_sub_epi8(dotscounts, _mm_set1_epi8(1));

  // shift it
  __m128i marked_input = _mm_blendv_epi8(input, dotscounts, dots);
  _mm_storeu_si128((__m128i *)dst, marked_input);
  // dst += 16;
  return (size_t)(src - srcinit);
}

Possible assembly (GCC):

name_to_dnswire_simd_fast(char const*, unsigned char*):
        movdqu  xmm1, XMMWORD PTR [rdi]
        mov     edx, 16
        palignr xmm1, XMMWORD PTR .LC0[rip], 15
        movdqa  xmm0, XMMWORD PTR .LC1[rip]
        movdqa  xmm2, XMMWORD PTR .LC3[rip]
        movdqa  xmm3, xmm1
        movdqa  xmm5, xmm1
        pshufb  xmm0, xmm1
        psrlq   xmm3, 4
        pxor    xmm1, xmm1
        pand    xmm3, XMMWORD PTR .LC2[rip]
        pshufb  xmm2, xmm3
        pand    xmm0, xmm2
        pxor    xmm2, xmm2
        pcmpeqb xmm0, xmm2
        movdqa  xmm3, XMMWORD PTR .LC4[rip]
        movdqa  xmm2, xmm1
        movdqa  xmm4, xmm1
        pmovmskb        eax, xmm0
        bsf     eax, eax
        sub     rdx, rax
        sub     rax, 1
        movdqu  xmm0, XMMWORD PTR name_to_dnswire_simd_fast(char const*, unsigned char*)::zero_masks[rdx]
        pblendvb        xmm5, xmm3, xmm0
        movdqa  xmm0, XMMWORD PTR .LC5[rip]
        pcmpeqb xmm3, xmm5
        pand    xmm0, xmm3
        palignr xmm2, xmm0, 1
        pblendvb        xmm2, xmm0, xmm0
        movdqa  xmm0, xmm2
        palignr xmm4, xmm2, 2
        pblendvb        xmm4, xmm2, xmm0
        movdqa  xmm2, xmm1
        movdqa  xmm0, xmm4
        palignr xmm2, xmm4, 4
        pblendvb        xmm2, xmm4, xmm0
        movdqa  xmm4, xmm1
        palignr xmm4, xmm2, 8
        movdqa  xmm0, xmm2
        movdqa  xmm6, xmm4
        pblendvb        xmm6, xmm2, xmm0
        pcmpeqd xmm2, xmm2
        movdqa  xmm0, xmm3
        palignr xmm1, xmm6, 1
        paddb   xmm1, xmm2
        psubb   xmm1, xmm6
        pblendvb        xmm5, xmm1, xmm0
        movups  XMMWORD PTR [rsi], xmm5
        ret

Possible assembly (LLVM):

name_to_dnswire_simd_fast(char const*, unsigned char*):     # @name_to_dnswire_simd_fast(char const*, unsigned char*)
        movdqu  xmm4, xmmword ptr [rdi]
        palignr xmm4, xmmword ptr [rip + .LCPI0_0], 15 # xmm4 = mem[15],xmm4[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]
        movdqa  xmm0, xmmword ptr [rip + .LCPI0_1] # xmm0 = [9,13,13,13,13,13,13,13,13,13,12,4,4,6,6,4]
        pshufb  xmm0, xmm4
        movdqa  xmm1, xmm4
        psrlq   xmm1, 4
        pand    xmm1, xmmword ptr [rip + .LCPI0_2]
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_3] # xmm2 = [0,0,2,1,4,8,4,8,0,0,0,0,0,0,0,0]
        pshufb  xmm2, xmm1
        pand    xmm2, xmm0
        pxor    xmm0, xmm0
        pcmpeqb xmm0, xmm2
        pmovmskb        eax, xmm0
        rep       bsf ecx, eax
        lea     rax, [rip + name_to_dnswire_simd_fast(char const*, unsigned char*)::zero_masks]
        sub     rax, rcx
        movups  xmm0, xmmword ptr [rax + 16]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_4] # xmm1 = [46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46]
        pblendvb        xmm4, xmm1, xmm0
        pcmpeqb xmm1, xmm4
        movdqa  xmm3, xmmword ptr [rip + .LCPI0_5] # xmm3 = [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143]
        pand    xmm3, xmm1
        movdqa  xmm2, xmm3
        psrldq  xmm2, 1                         # xmm2 = xmm2[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
        psrldq  xmm3, 2                         # xmm3 = xmm3[2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero,zero
        movdqa  xmm0, xmm2
        pblendvb        xmm3, xmm2, xmm0
        movdqa  xmm2, xmm3
        psrldq  xmm2, 4                         # xmm2 = xmm2[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero
        movdqa  xmm0, xmm3
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
        psrldq  xmm3, 8                         # xmm3 = xmm3[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
        movdqa  xmm0, xmm2
        pblendvb        xmm3, xmm2, xmm0
        pcmpeqd xmm2, xmm2
        pxor    xmm2, xmm3
        psrldq  xmm3, 1                         # xmm3 = xmm3[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
        paddb   xmm2, xmm3
        movdqa  xmm0, xmm1
        pblendvb        xmm4, xmm2, xmm0
        movdqu  xmmword ptr [rsi], xmm4
        mov     rax, rdi
        not     rax
        add     rax, rdi
        add     rax, rcx
        ret

lemire commented 1 year ago

My benchmark is limited to small inputs (like google.com) so being 3x faster is actually good.

I think that this routine could be made general, and ported to AVX.

Comments invited.

aqrit commented 1 year ago

I like it. Maybe use _mm_min_epi8 instead of _mm_blendv_epi8 ?

lemire commented 1 year ago

Min would work, yes. It appears to have better performance on paper.

lemire commented 1 year ago

Ok. Min is faster...

name_to_dnswire_simd_fast      :   1.82 GB/s  194.1 Ma/s   5.15 ns/d   3.20 GHz  16.47 c/d  52.02 i/d    1.8 c/b   5.56 i/b   3.16 i/c 
name_to_dnswire_simd           :   0.92 GB/s   98.2 Ma/s  10.19 ns/d   3.19 GHz  32.54 c/d  84.02 i/d    3.5 c/b   8.98 i/b   2.58 i/c 
name_to_dnswire                :   0.50 GB/s   52.9 Ma/s  18.89 ns/d   3.19 GHz  60.33 c/d  116.61 i/d    6.4 c/b  12.46 i/b   1.93 i/c

I am not going to repost the code, it is basically the same thing but some blendv are replaced by min instructions. It does not change the instruction count, but we increase the number of instructions per cycle.

The data source for the benchmark are these guys...

std::vector<std::string> database = {
    "google.com",  "live.com",    "netflix.com", "office.com",  "bing.com",
    "apple.com",   "office.net",  "skype.com",   "netflix.net", "msn.com",
    "lencr.org",   "icloud.com",  "youtube.com", "windows.net", "aaplimg.com",
    "c.lencr.org", "outlook.com", "windows.com", "msedge.net",  "yahoo.com",
    "ytimg.com",   "amazon.com",  "pki.goog",    "fbcdn.net",   "azure.com",
    "adnxs.com",   "demdex.net",  "gvt2.com",    "gvt1.com",    "sentry.io",
    "c.bing.com",  "ggpht.com",   "sfx.ms",      "adsrvr.org",  "adobe.com",
    "g.live.com",  "ntp.org",     "criteo.com",  "e2ro.com",    "licdn.com",
};

So if we can be highly efficient over such short strings, I am certain that we can be fast over more general strings (the longer the string, the faster you can go generally).

So now it is a matter of making the code more robust and including AVX2.

Let me restate that the current code assumes that the input fits in 15 characters... handling the more general case is not super hard... but not entirely trivial either.

k0ekk0ek commented 1 year ago

Nice! I'm going to study the code and come back with technical comments, but at this point it's probably also worth sharing some of the intricacies of zone data(?)

Domain names can consist of both a contiguous set of characters and quoted strings. Quoted strings are very uncommon.
Domain names may contain any octet 0-255, the sum of length octets determines the total length.
- Most likely the domain names are also valid hostnames, but common characters are 0-9, a-z, A-Z, -, and _ (latter is not a valid hostname character but used in e.g. SRV RRs).
To include characters that have special meaning, they must be escaped using a backslash.
- Valid escape sequences are \000 or \X. e.g. a ` (dot) can be included as\046or.`.
Domain names, especially in RDATA maybe not the owner, will quickly exceed 15 characters.
Lookup of domain names is case-insensitive, but casing is best retained.
- I'd like to introduce an option in the parser for case normalization eventually.

(background information, for completeness sake, feel free to skip)

An RR (resource record) is expressed as <owner> [CLASS and/or TTL] <rrtype> <RDATA>, e.g. www.example.com. A 192.0.2.1.

Zone files consist of a sequence of RRs. The presentation format (what we parse), is used to express them in plain text. The format is most frequently used for defining zones, but is used in other scenarios too. A zone contains authoritative data for a subtree of the DNS. What that data is like depends on the zone. e.g. .com contains mostly so-called glue data. e.g. NS records that indicate what the authoritative source for example.com is. The example.com zone will contain mostly non-glue data, e.g. the A record for www.example.com.

For brevity sake, users may put $ORIGIN directives in the file. Non-complete, or relative, domain names are automatically postfixed with the $ORIGIN. For example, to specify an NS record for example.com. one might write:

example.com. NS ns1.example.net.
example.com. NS ns2.example.net.

Or:

$ORIGIN com.
example NS ns1.example.net.
example NS ns2.example.net.

Or (owner is copied if line starts with space):

$ORIGIN com.
example NS ns1.example.net.
        NS ns2.example.net.

The latest .com and .se I have (freely accessible, see Reference data in CONTRIBUTING.md) seem to use fully qualified domain names to specify the owner and domain names in RDATA. An older .com I have uses relative names for the owner.

lemire commented 1 year ago

I will be extending the code to go over 15 bytes, obviously.

My recommendation is to provide a fast path that handles the common and simple cases and to use a fallback for the general case.

k0ekk0ek commented 1 year ago

The patch below improves performance for me and allows us to match everything but the delimiters:

diff --git a/2023/08/pending/src/name_to_dnswire.c b/2023/08/pending/src/name_to_dnswire.c
index 04c55e8..03c1f13 100644
--- a/2023/08/pending/src/name_to_dnswire.c
+++ b/2023/08/pending/src/name_to_dnswire.c
@@ -274,30 +274,11 @@ size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {
   // It is likely that many name sfits under 16 bytes, however.

   // We do vectorized classification to validate the content.
-  // We want to allow 0x30 to 0x39 (digits)
-  // The hyphen 0x2d.
-  // The dot 0x2e.
-  // The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
-  // The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)
+  // We want to allow everything but unescaped delimiters (see below)

-  const char DIGIT = 0x01;
-  const char HYPHENDOT = 0x02;
-  const char LETTERAO = 0x04;
-  const char LETTERPZ = 0x08;
   static int8_t zero_masks[32] = {0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
                                   0,  0,  0,  0,  0,  -1, -1, -1, -1, -1, -1,
                                   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
-  __m128i highnibbles =
-      _mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
-                    LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
-  __m128i lownibbles =
-      _mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
-                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
-                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
-                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
-                    LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
-                    LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
-                    LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
   // we always insert a fake '.' initially.
   __m128i input = _mm_loadu_si128((const __m128i *)src);
   input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
@@ -305,14 +286,30 @@ size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {

   // We could possibly 'upper case everything' if we wanted to.
   //   __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
-  __m128i low = _mm_shuffle_epi8(
-      lownibbles,
-      input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
-              // if high bit is set, there is no match
-  __m128i high = _mm_shuffle_epi8(
-      highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
-  __m128i classified =
-      _mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
+
+  // Delimiters for strings consisting of a contiguous set of characters
+  // 0x00        :       end-of-file : 0b0000_0000
+  // 0x20        :             space : 0b0010_0000
+  // 0x22        :             quote : 0b0010_0010
+  // 0x28        :  left parenthesis : 0b0010_1000
+  // 0x29        : right parenthesis : 0b0010_1001
+  // 0x09        :               tab : 0b0000_1001
+  // 0x0a        :         line feed : 0b0000_1010
+  // 0x3b        :         semicolon : 0b0011_1011
+  // 0x0d        :   carriage return : 0b0000_1101
+  const __m128i delimiters = _mm_setr_epi8(
+    0x00, 0x00, 0x22, 0x00, 0x00, 0x00, 0x00, 0x00,
+    0x28, 0x09, 0x0a, 0x3b, 0x00, 0x0d, 0x00, 0x00);
+
+  const __m128i mask = _mm_setr_epi8(
+    -33, -1, -1, -1, -1, -1, -1, -1, -1, -33, -1, -1, -1, -1, -1, -1);
+
+  // construct delimiter pattern for comparison
+  __m128i pattern = _mm_shuffle_epi8(delimiters, input);
+  // clear 5th bit (match 0x20 with 0x00, match 0x29 using 0x09)
+  __m128i inputc = _mm_and_si128(input, _mm_shuffle_epi8(mask, input));
+
+  __m128i classified = _mm_cmpeq_epi8(inputc, pattern);
   // m cannot be zero!!!
   unsigned m = (unsigned)_mm_movemask_epi8(
       classified); // should be 1 wherever we have a match.

lemire commented 1 year ago

The patch below improves performance for me and allows us to match everything but the delimiters:

Sure. Let us go with that.

lemire commented 1 year ago

@k0ekk0ek I almost have the general version (over possibly very large inputs) working. It is going to be fast.

I am still stuck with at least one bug, but it is a small one.

Moving to AVX should help too.

k0ekk0ek commented 1 year ago

Sounds good @lemire! I saw the code is available in your repo, I'll study it asap to see if I can help out.

lemire commented 1 year ago

No need. I need to fix it up first. I just need time.

lemire commented 1 year ago

It works now over general inputs...

$ ./benchmark top-1m.csv 
loaded 1000000 names
average length 23.9394 bytes/name
name_to_dnswire_simd           :   1.58 GB/s   66.0 Ma/s  15.14 ns/d   3.19 GHz  48.26 c/d  109.09 i/d    2.0 c/b   4.56 i/b   2.26 i/c 
name_to_dnswire                :   0.51 GB/s   21.5 Ma/s  46.57 ns/d   3.19 GHz  148.42 c/d  242.74 i/d    6.2 c/b  10.14 i/b   1.64 i/c

k0ekk0ek commented 1 year ago

Finally found time to take a decent look at this. Makes sense, starting from the back. Simplifies the state problem a lot. I love it, great stuff @lemire!

I only did minimum testing, but the returned length is off-by-one and null-labels are accepted. e.g. if I input foo..bar, I get <3>foo<0><3>bar, which should be flagged as an invalid input. For zone data, we cannot just suffix a 0 as not having a dot there means the name is relative, but that's a quick fix.

It's probably worth keeping the current implementation in there for reference? At least, it'd be good to see how it stacks up with different iterations.

Lastly, and this is really a note to self, but it may be worth reintroducing the length with the token. For basically every input we're now first determining the length and only then can we proceed to do the actual operations we want. Before, name parsing was modeled after string parsing and we did a best effort thing. So, copy, then see if we need to add less than , not having a length kinda made sense. But now that @lemire has (very likely) cracked faster parsing of names, I should really take another look. I'm sure storing both indexes in the same tape is not going to work, but it may work to keep two tapes (simdjson jargon). So, one for the start, one for the end. We can do all logic based on the start index and make sure there's always a terminating index (we guarantee that now too).

lemire commented 1 year ago

It's probably worth keeping the current implementation in there for reference? At least, it'd be good to see how it stacks up with different iterations.

Would you produce a pull request?

k0ekk0ek commented 1 year ago

The newer algorithm is faster, even with real-world data, see PR#81.

aqrit commented 1 year ago

@k0ekk0ek

The patch below improves performance for me and allows us to match everything but the delimiters

That detects 0x80..0xFF as delimiters, is that intentional? If that is not wanted then try:

    const __m128i dot = _mm_set1_epi8(0x2E);
    const __m128i table = _mm_setr_epi8(
        0x20, 0x00, 0x22, 0x0D, 0x0A, 0x00, 0x00, 0x09,
        0x28, 0x29, 0x00, 0x3B, 0x00, 0x00, 0x00, 0x00
    );

    __m128i key = _mm_max_epu8(_mm_xor_si128(dot, v), v);
    __m128i is_delimiter = _mm_cmpeq_epi8(_mm_shuffle_epi8(table, key), v);

It adds 1 instruction to the critical path.

lemire commented 1 year ago

Blog post: Coding of domain names to wire format at gigabytes per second.

Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/08/09 (new location)

The code does include a draft of @k0ekk0ek 's implementation, but it is not discussed in the blog post.

k0ekk0ek commented 1 year ago

Thanks @aqrit! The spec merely states all delimiters must be quoted, as a result we treat all non-delimiter bytes as valid input. In other words, it's a bug :sweat_smile:

Thanks @lemire! I liked the post and it shows there's ways to improve, even if it's not perfect yet. Hopefully readers will have additional suggestions(?)

I'll work on this as soon as I get a chance (my availability for the next two weeks is minimal). Your input and suggestions are much appreciated, as always.

k0ekk0ek commented 1 year ago

I'll try to see if I can efficiently get the length back. If it's feasible I'll create a new ticket outlining the idea. I'll also try to integrate @lemire's algorithm (work on detection internal null-labels for a bit too). @lemire, one of the reactions to your post outlines another idea, did you have a chance to work on it?

lemire commented 1 year ago

@k0ekk0ek Kendall was making a reference to what I call the prefix-minimum approach. It is faster, but it does not include validation. His approach could potentially save a few instructions, maybe, on the prefix-minimum approach, but it won't get you validation, which I expect you want to have. It is also weirder code, harder to maintain. I would discourage you from considering it.

I recommend the simdjson-like approach (i.e., name_to_dnswire_idx_avx): https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/a03cf57214b7b3f736d0870b1f1fa4923bd71df2/2023/08/09/src/name_to_dnswire.c#L623

It is fast and it includes full validation. And it is maintainable. My code is not the nicest in the world, but it is not bad even as is.

k0ekk0ek commented 1 year ago

@lemire, thank you for clarifying Kendall's approach. Interesting name_to_dnswire_idx_avx is very similar to the current code, but somewhat faster. Especially when benchmarking with the top1m list (1.79GB/s vs 1.34GB/s). I think that's mainly due to elimination of branches. I'll give it a good look.

EDIT: Ah, yes, illegal sequences like .. (null labels) are caught using a simple shift+and. That's a good way to check for that particular input error once vs once every label.

lemire commented 1 year ago

@k0ekk0ek Yes, it is similar and uses much of your own code. But I think it reflects the fact that it might be the right overall design.

k0ekk0ek commented 1 year ago

I think it's the right approach indeed. I'll see if I can clean it up a little and issue a PR against your repo so we have benchmarks.

lemire commented 1 year ago

Fantastic!

k0ekk0ek commented 1 year ago

This is what I have so far (issue a PR too). It works sort-off the same as @lemire's recommendation, but it's easier to integrate handling of escape sequences (i.e. \X and \DDD). It's still somewhat slower than name_to_dnswire_idx_avx though (and not yet complete). But posting it here early to see if @lemire or @aqrit have ideas on how to improve further.

#define likely(params) __builtin_expect(!!(params), 1)
#define unlikely(params) __builtin_expect(!!(params), 0)

// simplified version of name_to_dnswire_idx_avx
size_t name_to_dnswire_loop(const char *src, uint8_t *dst)
{
  const char *text = src;
  uint8_t *octets = dst, *wire = octets + 1;
  uint64_t label = 0;

  octets[label] = 0;

  // real world domain names quickly exceed 16 octets (www.example.com is
  // encoded as 3www7example3com0, or 18 octets), but rarely exceed 32
  // octets. encode in 32-byte blocks.
  __m256i input = _mm256_loadu_si256((const __m256i *)text);
  _mm256_storeu_si256((__m256i *)wire, input);

  const __m256i dot = _mm256_set1_epi8('.');

  uint64_t delimiter = delimiter_mask_avx(input);
  uint64_t dots = (uint32_t)_mm256_movemask_epi8(_mm256_cmpeq_epi8(input, dot));

  uint64_t limit = delimiter | (1llu << 32);
  uint64_t length = _tzcnt_u64(limit);

  if (unlikely(dots & 1llu)) // root (single ".") is a special case
    return length == 1;

  // FIXME: account for escape sequences

  dots &= (1llu << length) - 1;
  text += length;
  wire += length;
  // check for null labels, i.e. ".."
  if (unlikely(dots & (dots >> 1)))
    return 0;

  if (dots) {
    uint64_t base, count = _tzcnt_u64(dots);
    dots &= dots - 1;
    octets[label] = (uint8_t)count;
    label += count + 1;

    while (dots) {
      base = count;
      count = _tzcnt_u64(dots);
      const uint64_t diff = count - base;
      dots &= dots - 1;
      octets[label] = (uint8_t)(diff - 1);
      label += diff;
    }

    octets[label] = (uint8_t)((length - count) - 1);
  } else {
    octets[label] = (uint8_t)length;
  }

  if (likely(delimiter))
    return length + 1;

  // labels in domain names are limited to 63 octets. track length octets
  // (dots) in 64-bit wide bitmap. shift by length of block last copied to
  // detect null-labels and labels exceeding 63 octets (zero)
  uint64_t labels = (dots << (64 - length)) | ((1llu << 63) >> length);

  do {
    input = _mm256_loadu_si256((const __m256i *)text);
    _mm256_storeu_si256((__m256i *)wire, input);

    delimiter = delimiter_mask_avx(input);
    dots = (uint32_t)_mm256_movemask_epi8(_mm256_cmpeq_epi8(input, dot));

    limit = delimiter | (1llu << 32);
    length = _tzcnt_u64(limit);

    // FIXME: account for escape sequences
    dots &= (1llu << length) - 1;
    text += length;
    wire += length;
    labels = (dots << (64 - length)) | (labels >> length);
    // check for null labels, i.e. ".."
    if (unlikely(labels & (labels >> 1)))
      return 0;

    if (dots) {
      uint64_t base, count = _tzcnt_u64(dots);
      dots &= dots - 1;
      octets[label] += (uint8_t)count;
      label += count + 1;

      while (dots) {
        base = count;
        count = _tzcnt_u64(dots);
        const uint64_t diff = count - base;
        dots &= dots - 1;
        octets[label] = (uint8_t)(diff - 1);
        label += diff;
      }

      octets[label] = (uint8_t)((length - count) - 1);
    } else {
      // check if label exceeds 63 octets
      if (!labels)
        return 0;
      octets[label] += (uint8_t)length;
    }
  } while (!delimiter);

  return (size_t)(wire - dst);
}

EDIT: I should note that the 32-byte block is a guestimate, 64-byte blocks work better for top-1m.csv, but I suspect 32-bytes is better overall. It's a hunch, I've not measured.

lemire commented 1 year ago

Reproducing here the gist of the results... (name_to_dnswire_idx_avx is from my blog post)

name_to_dnswire_idx_avx        :   1.73 GB/s   72.2 Ma/s  13.85 ns/d   3.19 GHz  44.14 c/d  77.31 i/d    1.8 c/b   3.23 i/b   1.75 i/c 
name_to_dnswire_loop           :   1.44 GB/s   60.3 Ma/s  16.59 ns/d   3.19 GHz  52.87 c/d  92.20 i/d    2.2 c/b   3.85 i/b   1.74 i/c

So your approach use about 15 extra instruction per parsed input. Overall, it is a performance difference of 20%.

It probably does not matter. I think you can go with your approach and it should serve you well. I don't think you should be anxious about a 20% difference in a microbenchmark.

k0ekk0ek commented 1 year ago

@lemire, I updated my code to give it a slight boost and did a quick test with all owners from a .se zone (PR issued) in that scenario it seems faster (to my surprise :sweat_smile: ). And, like I stated on the PR, the prefix-minimum is actually slower than any of the simdjson-like approaches for just benchmark and benchmark se.csv, confirming that it is the right overall design(?). I'll get the code integrated into simdzone. Hopefully it'll give us another nice boost.

lemire commented 1 year ago

@k0ekk0ek Would you give me access to this .se data file? I'd like to have an archive of it so that if better designs come up, we can test it out on this alternative dataset.

k0ekk0ek commented 1 year ago

@lemire, you can obtain a current copy via zone transfer. From the looks of it, the data is very similar to the one I have, meaning lots of DNSSEC related data and the usual glue RRs. CONTRIBUTING.md in this repository also links to the Centralized Zone Data Service (CZDS), which allows you to download zone data for many more TLDs (.com included). For my test, I simply took the owner (first field) and prepended it with a number and a comma, i.e. 1,<owner>.

lemire commented 1 year ago

Good! Noted.

NLnetLabs / simdzone

Faster parsing of domain names #66