Open sigmaSd opened 2 months ago
We wouldn't use node:buffer
in order to maintain browser compatibility. Either way, any PRs that improve the performance of these functions are welcome.
Only atob
is being used, not btoa
. Also, it's not being used in the btoa(String.fromCharCode(...bytes))
manner that I typically associate with call stack size issues. Do you mean to say you are actually running out of heap memory? How large is "large"?
import { encodeBase64 } from "jsr:@std/encoding/base64";
encodeBase64(Deno.readFileSync(Deno.execPath()));
this crashes with out of memory error, if I use node buffer insteada it finish in a couple of milliseconds
It seems that the memory exhaustion is likely coming from all of the string appending. I imagine that this causes a memory allocation with every new character that is appended. I have no idea if this change makes this as fast as node (probably not,) but at the very least it eliminates unnecessary memory allocations.
>>> NOT TESTED <<<
const decoder = new TextDecoder()
const base64abc = Uint8Array.from([
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 43, 47
]);
export function encodeBase64(data: ArrayBuffer | Uint8Array | string): string {
// CREDIT: https://gist.github.com/enepomnyaschih/72c423f727d395eeaa09697058238727
const uint8 = validateBinaryLike(data);
const result = new Uint8Array(Math.floor((uint8.length + 2) / 3) * 4);
let i, j = 0;
const l = uint8.length;
for (i = 2; i < l; i += 3) {
result[j++] = base64abc[(uint8[i - 2] >> 2)];
result[j++] = base64abc[
(((uint8[i - 2] & 0x03) << 4) | ((uint8[i - 1] >> 4)))
];
result[j++] = base64abc[
(((uint8[i - 1] & 0x0f) << 2) | ((uint8[i] >> 6)))
];
result[j++] = base64abc[(uint8[i] & 0x3f)];
}
if (i === l + 1) {
// 1 octet yet to write
result[j++] = base64abc[(uint8[i - 2] >> 2)];
result[j++] = base64abc[((uint8[i - 2] & 0x03) << 4)];
result[j++] = 61; // '='
result[j++] = 61; // '='
}
if (i === l) {
// 2 octets yet to write
result[j++] = base64abc[(uint8[i - 2] >> 2)];
result[j++] = base64abc[
(((uint8[i - 2] & 0x03) << 4) | ((uint8[i - 1] >> 4)))
];
result[j++] = base64abc[((uint8[i - 1] & 0x0f) << 2)];
result[j++] = 61; // '='
}
return decoder.decode(result);
}
What about running the node code if we're on a server context and the current version (or an optimized one) if we detect we're on the browser
Also is it ok for std to use wasm libraries, in that case even the browser branch can be made efficient
On large buffers, it seems like Node's Buffer.toString("base64") is about 6-7 times faster than the code I provided.
import { encodeBase64 } from "jsr:@std/encoding/base64"; encodeBase64(Deno.readFileSync(Deno.execPath()));
@FergoTheGreat Can you check that the above snippet passes with your change? If it passes, it would be far better than the current state.
Presonally I think you should just use wasm for this , it's the best use case of wasm (pure math work), I imagine it's going to be faster then any optimized js version https://docs.rs/base64/latest/base64/ should work well
Also can you confirm it's not possible or a good idea to use node apis conditionally when we detect we're not on the browser?
import { encodeBase64 } from "jsr:@std/encoding/base64"; encodeBase64(Deno.readFileSync(Deno.execPath()));
@FergoTheGreat Can you check that the above snippet passes with your change? If it passes, it would be far better than the current state.
With that change, the only limiting factor is that decoder.decode
cannot produce a string with length larger than 2**29-24, meaning that it will fail for buffers > 402,653,166 bytes.
Also can you confirm it's not possible or a good idea to use node apis conditionally when we detect we're not on the browser?
I'm not in favor of changing implementation completely based on the environment.
If I looked at the Deno implementation, atob
and Buffer.from(string, "base64")
are backed by the very similar ops https://github.com/denoland/deno/blob/f460188e583f00144000aa0d8ade08218d47c3c1/ext/web/lib.rs#L131-L144
I'm not sure switching to Buffer.from will boost the performance significantly.
Another point is that encoding and decoding of Base64 are planned as a language feature now (It seems already at Stage 3) https://github.com/tc39/proposal-arraybuffer-base64 Waiting for this to be added in CLI would be also an option
Here is a benchmark
import {
decodeBase64 as decodeBase64Wasm,
encodeBase64 as encodeBase64Wasm,
} from "jsr:@sigma/rust-base64";
import {
decodeBase64 as decodeBase64Std,
encodeBase64 as encodeBase64Std,
} from "jsr:@std/encoding";
import { Buffer } from "node:buffer";
const length = Math.pow(2, 20);
const input = "a".repeat(length);
const encodedInput = new TextEncoder().encode(input);
const base64Input = encodeBase64Wasm(encodedInput);
const encodedBase64Input = new TextEncoder().encode(base64Input);
Deno.bench("encodeBase64Wasm", { group: "encoding" }, () => {
encodeBase64Wasm(encodedInput);
});
Deno.bench("node:buffer:encode", { group: "encoding" }, () => {
Buffer.from(input).toString("base64");
});
Deno.bench("encodeBase64Std", { group: "encoding" }, () => {
encodeBase64Std(input);
});
Deno.bench("btoa", { group: "encoding", baseline: true }, () => {
btoa(base64Input);
});
Deno.bench("decodeBase64Wasm", { group: "decoding" }, () => {
decodeBase64Wasm(encodedBase64Input);
});
Deno.bench("node:buffer:decode", { group: "decoding" }, () => {
Buffer.from(base64Input, "base64");
});
Deno.bench("decodeBase64Std", { group: "decoding" }, () => {
decodeBase64Std(base64Input);
});
Deno.bench("atob", { group: "decoding", baseline: true }, () => {
atob(base64Input);
});
seems like the only severe outlier is std encodeBase64
I also tried @FergoTheGreat implementation and it seems that indeed it doesn't crash with large buffers, and in my benchmark its on par with btoa
It seems @FergoTheGreat's implementation is quite the improvement
These function use atob and btoa internally and they oom easily with large strings
Maybe instead they should use node:buffer (Buffer.from(i).toString("base64") Buffer.from(i,"base64"))