Open k0ekk0ek opened 1 year ago
@k0ekk0ek Are spaces allowed within the codes? That's a big limiting factor pre-AVX-512.
Sort-of, but I don't think they're an issue(?) Spaces (among others) are delimiters within zone files so we get a separate token for each block of characters (we require this data to be specified as contiguous, so no spaces inside the token). However, we will need to track state because there may be multiple tokens (at least for RRSIG (presentation format)).
An example of how this occurs in zone data:
host.example.com. 86400 IN RRSIG A 5 3 86400 20030322173103 (
20030220173103 2642 example.com.
oJB1W6WNGv+ldvQ3WDG0MQkg5IEhjRip8WTr
PYGv07h108dUKGMeDPKijVCHX3DDKdfb+v6o
B9wfuh3DTJXUAfI/M0zmO/zz8bW0Rznl8O3t
GNazPwQKkRN20XPXV6nwwfoXmJQbsLNrLfkG
J5D6fwFm8nN+6pBzeDQfsS3Ap3o= )
Looking at .se
they cut the signature up into blocks of 56 characters (without (
... )
, so one line), while .com
seems to keep it as one set of contiguous characters.
This is an important problem. Node.js has a related issue.
See https://github.com/nodejs/performance/issues/128
I want to work on solving this issue as soon as I can.
Sounds good @lemire! The client9 decoders are probably good to have as fallback(?) At least the base64 one got adopted and modified by @aklomp. The same can be done for base16 (hex). At least, I am planning to do so. The stream decoding interface fits simdzone nicely. Something I noticed in the modified version is that whole blocks need to be written. So the base64 AVX2 version requires at least 45 bytes of input (8 bytes for the fallback) to ensure proper output. Since simdzone guarantees enough padding in input and output buffers, we can perhaps let that requirement go. That may make things speedier/easier(?) Of course, this may or may not apply to Node.js. I'm currently focused on getting the SVCB RRTYPE done, after that testing and api cleanup to get to a first release, but if you need me to look at something, I'd be happy to help!
The main challenge I am concerned with are the spaces in the input… The overflows can be handled in some way but I agree that simdzone can benefit of faster routines than other systems.
Of course, this may or may not apply to Node.js
I am not sure how much of a difference this makes. It is an empirical issue and it is easy to implement both padded and strict versions.
I/we'd have to test. Adding the faster base64 decoder made a big difference for the .se
zone. Going by that example, any TLD zone that has widespread DNSSEC usage (such as .nl
) will benefit. A really simple test would be to replace consecutive dec_loop_generic_32_inner
calls in dec_loop_generic_32
by one (or more) SIMD function replacements. Also note that we don't have to account for spaces in the input. That part is handled automatically because they're separate tokens (base64 is to be written as <contiguous>
not <quoted>
, except for svcb parameters, but there we simply do not allow spaces), so we can quite easily select the right decoder based on token length.
I'd have to study the library a bit better, but the loop currently selects for rounds 8, 4, 2 and 1. As 8x4 = 32 and 4x4 = 16, we could simply add AVX2 and SSE versions there.
I'll be working on this (starting 'now'). My current 'vision' is to make it configurable so one can explore the usage space.
Sounds good :+1:
So I think I solved the problem, although maybe not for simdzone directly.
The way the problem appears in simdzone is somewhat non-standard and I wanted to work on a more generic problem. So I have worked on the WHATWG forgiving-base64 problem: you must decode base64 in a string but there might be arbitrary ASCII spaces all over the string. It is part of the latest release of simdutf.
We have currently integrating it in Node.js with decent results: https://github.com/nodejs/node/pull/52428 It passes all the tests in simdutf and in Node.js, so it is pretty solid and it is fast.
Whether it applies to simdzone as is is trickier because simdzone has the pattern where the location of the white spaces are known before you start parsing.
So maybe it is worth experimenting.
Thanks for the update @lemire. Looked at the numbers, that's another impressive boost!
(sorry for the overlong/overcomplete reply)
Indeed, simdzone tokenizes based on whitespace (amongst other delimiters), so situation is a little different. The work may still be very useful. There are quite some RR TYPEs where the RDATA section contains base64 data. e.g. OPENPGPKEY
, DNSKEY
, RRSIG
, etc. While it may be possible to pop all tokens of the tape and use the work to decode those sequences, simdzone does not read the entire file at once, but in blocks. Popping a token may lead to the parser flushing the buffer and proceding to read the next section. The scanner merely guarantees the complete token is available until the next one is read. Of course, it's entirely possible to alter tape operation for these types of sequences. Whether that's worth the effort is something to investigate. Another complexity is the possible uses of parentheses to wrap newlines. For example, this is a valid RRSIG
RR (RFC4034 section 3.2):
host.example.com. 86400 IN RRSIG A 5 3 86400 20030322173103 (
20030220173103 2642 example.com.
oJB1W6WNGv+ldvQ3WDG0MQkg5IEhjRip8WTr
PYGv07h108dUKGMeDPKijVCHX3DDKdfb+v6o
B9wfuh3DTJXUAfI/M0zmO/zz8bW0Rznl8O3t
GNazPwQKkRN20XPXV6nwwfoXmJQbsLNrLfkG
J5D6fwFm8nN+6pBzeDQfsS3Ap3o= )
However, the RDATA section may use parentheses, without nesting, anywhere. i.e. this is valid too:
host.example.com. 86400 IN RRSIG A 5 3 86400 20030322173103 (
20030220173103 2642 example.com.
oJB1W6WNGv+ldvQ3WDG0MQkg5IEhjRip8WTr
) ( PYGv07h108dUKGMeDPKijVCHX3DDKdfb+v6o
) ( B9wfuh3DTJXUAfI/M0zmO/zz8bW0Rznl8O3t
) ( GNazPwQKkRN20XPXV6nwwfoXmJQbsLNrLfkG
) ( J5D6fwFm8nN+6pBzeDQfsS3Ap3o= )
Of course, no one ever writes/generates the latter, but it is something to consider.
Base64 encoding is used quite a lot. i.e. in
RRSIG
andDNSKEY
records. Wojciech Muła and Daniel Lemire wrote the paper "Faster Base64 Encoding and Decoding Using AVX2 Instructions" in which they outline use of SIMD instructions to improve performance. The current implementation is very basic and improving it is extremely likely to result in a significant performance boost. More research is available too and even improving the current scalar implementation is a big gain.