Byte utilities: reduce duplication, use optimized technique

paulmillr commented 1 month ago

While browsing single-file-evm.js, i've noticed there are at least a bunch of duplicate hexToBytes and concatBytes methods. While duplication is not a big deal, it could be avoided.

But the issue is about optimization. noble exposes hexToBytes in its code. I see ethereumjs implemented custom version using Regular Expressions, which are 2x slower by my benchmarks. On larger inputs, the difference is 3x.

Overall, noble utilities could be used in a following way:

import { hexToBytes as unprefixedHexToBytes } from '@noble/hashes/utils';
function hexToBytes(hex) {
  if (!hex.startsWith('0x')) throw new Error('hex string must start with 0x');
  return unprefixedHexToBytes(hex.slice(2));
}

Let me know if you want it to be exposed from ethereum-cryptography, which utils are already using.

acolytec3 commented 1 month ago

Let me do some benchmarking of our version vs noble. Definitely open to it if it proves out that yours is faster (since we do boatloads of hex <-> bytes conversions. I think the regex in ours is just to check for non hex characters so could potentially be removed if it was the sole source of drag.

acolytec3 commented 1 month ago

Victory - noble

acolytec3 commented 1 month ago

I added a second test for bytes2Hex.

Victory - noble

acolytec3 commented 1 month ago

Here's the benchmark script (using the vitest benchmark feature) if anybody else wants to run it for comparison.

import { bytesToHex as nobleB2H, hexToBytes as nobleH2B } from '@noble/curves/abstract/utils'
import { bench, describe } from 'vitest'

import { bytesToHex, randomBytes, unprefixedHexToBytes } from '../src/bytes.js'

describe('h2b benchmarks', () => {
  bench('noble', () => {
    nobleH2B('0123456789abcdef')
  })
  bench('ethjs', () => {
    unprefixedHexToBytes('0123456789abcdef')
  })
})

describe('b2h benchmarks', () => {
  const bytes = randomBytes(32)
  bench('noble', () => {
    nobleB2H(bytes)
  })
  bench('ethjs', () => {
    bytesToHex(bytes)
  })
})

acolytec3 commented 1 month ago

I will say that we do intentionally duplicate the bytes conversion utilities inside our rlp package because we want that to be an entirely standalone package with no dependencies. Maybe we could borrow your variants for there so they're faster.

Beyond that, we already import ethereum-cryptography in util (where our other bytes2Hex and hex2Bytes functions live) so we've already implicitly got noble/curves there so easy to add it as a direct dependency I guess.

@holgerd77 What do you think? We'd get looks like a 1.5-2x speed improvement using the noble versions of these functions which could be hugely beneficial given the variety of places we do these bytes/hex conversions.

acolytec3 commented 1 month ago

@paulmillr I looked a little further and it looks like ethereum-cryptography already re-exports the bytesToHex/hexToBytes functions from @noble/hashes. I'm assuming this is the same version of the utilites as in noble/curves? If so, I think we can probably safely just re-export them for use in our own libraries.

paulmillr commented 1 month ago

The exported util from jsec is not the same. It allows 0x, but it doesn’t validate for 0x presence.

So there will need to be added a direct noble reexport.

jochem-brouwer commented 1 month ago

Just dropping this here, we added regex check to be sure that the hex bytes were properly formatted, see: https://github.com/ethereumjs/ethereumjs-monorepo/pull/3185

We can add this for optimization, but then only in areas where we are 100% sure that the string is correctly formatted.

jochem-brouwer commented 1 month ago

Oh, interesting, it looks like the noble method actually does the hex check inside the method. In that case we should likely change!

paulmillr commented 1 month ago

@acolytec3 I agree that rlp should duplicate h2b code if the goal is no-unnecessary-deps

paulmillr commented 1 month ago

Maybe just copy-paste this code until the next release of eth-crypto (which would be in october)

// Needs `isBytes()` function

// We use optimized technique to convert hex string to byte array
const asciis = { _0: 48, _9: 57, _A: 65, _F: 70, _a: 97, _f: 102 } as const;
function asciiToBase16(char: number): number | undefined {
  if (char >= asciis._0 && char <= asciis._9) return char - asciis._0;
  if (char >= asciis._A && char <= asciis._F) return char - (asciis._A - 10);
  if (char >= asciis._a && char <= asciis._f) return char - (asciis._a - 10);
  return;
}

/**
 * @example hexToBytesUnprefixed('cafe0123') // Uint8Array.from([0xca, 0xfe, 0x01, 0x23])
 */
export function hexToBytesUnprefixed(hex: string): Uint8Array {
  if (typeof hex !== 'string') throw new Error('hex string expected, got ' + typeof hex);
  const hl = hex.length;
  const al = hl / 2;
  if (hl % 2) throw new Error('padded hex string expected, got unpadded hex of length ' + hl);
  const array = new Uint8Array(al);
  for (let ai = 0, hi = 0; ai < al; ai++, hi += 2) {
    const n1 = asciiToBase16(hex.charCodeAt(hi));
    const n2 = asciiToBase16(hex.charCodeAt(hi + 1));
    if (n1 === undefined || n2 === undefined) {
      const char = hex[hi] + hex[hi + 1];
      throw new Error('hex string expected, got non-hex character "' + char + '" at index ' + hi);
    }
    array[ai] = n1 * 16 + n2;
  }
  return array;
}

// Array where index 0xf0 (240) is mapped to string 'f0'
const hexes = /* @__PURE__ */ Array.from({ length: 256 }, (_, i) =>
  i.toString(16).padStart(2, '0')
);
/**
 * @example bytesToHex(Uint8Array.from([0xca, 0xfe, 0x01, 0x23])) // 'cafe0123'
 */
export function bytesToHex(bytes: Uint8Array): string {
  if (!isBytes(bytes)) throw new Error('bytes expected');
  // pre-caching improves the speed 6x
  let hex = '';
  for (let i = 0; i < bytes.length; i++) {
    hex += hexes[bytes[i]];
  }
  return hex;
}

holgerd77 commented 1 month ago

Ok, yeah, these are impressive (benchmark) results! 🤩

Using Pauls Versions

I would be strongly in favor that we use Paul's code here, but I would also really like to stay in our own convention with always-prefixed-hex, we put so much effort into that to align this.

So I would (also) suggest that we use the versions from ethereum-cryptography/@noble/curves in Util but then wrap this in our own methods and re-export, so that we can keep the names and eventually add additional hex prefix checks if neceesary.

Benchmarks Integration

Great that your on this "benchmark trip", can you directly add this to Util? (so at least the EthereumJS ones, but maybe also with Noble depending on the dependency situation) Bit more written out description would also be nice. 🙂

It would be great if we generally establish a common location for new (vitest) benchmarks.

I would cautiously think we might even want to put this directly under test? This is so strongly related (one might also combine functionality here), also this higher level benchmarks folders feel so very "off" (maybe it's also because they are so "far away" from the other folders due to alphabetical sorting?) and one tends to forget about these.

And in the test folder, this is a natural working directory anyhow, so bigger chances to stumble upon the benchmarks every now and then! 😄

So I would suggest: test/bench

Does that make sense? Open for other suggestions though!

acolytec3 commented 1 month ago

Benchmarks are tricky once we merge the PR because we no longer have the internal version of the code once we switch to using Noble stuff internally.

holgerd77 commented 1 month ago

Benchmarks are tricky once we merge the PR because we no longer have the internal version of the code once we switch to using Noble stuff internally.

Just take only the version we have I would suggest (even if a bit trivial)

jochem-brouwer commented 1 month ago

@acolytec3 is this closed by https://github.com/ethereumjs/ethereumjs-monorepo/pull/3698 ?

acolytec3 commented 1 month ago

Yes

ethereumjs / ethereumjs-monorepo

Byte utilities: reduce duplication, use optimized technique #3693

Using Pauls Versions

Benchmarks Integration