email-reciever / website

A website that automatically translates subscription source emails and organizes the output to be saved as a document. The translation is a bit mechanical though. But can save some unnecessary access time
https://email-reciever.page.dev
MIT License
1 stars 0 forks source link

Multithreading in Node.js: Using Atomics for Safe Shared Memory Operations #8

Open innocces opened 1 week ago

innocces commented 1 week ago

The original blog info

subject content
title Multithreading in Node.js: Using Atomics for Safe Shared Memory Operations
url blog url
author Pavel Romanov
innocces commented 1 week ago

Node.js developers got too comfortable with a single thread where JavaScript is executed. Even with the introduction of multiple threads via worker_threads, you can feel pretty safe.

However, things change when you add shared resources to multiple threads. In fact, it is one of the most challenging topics in all software engineering. I'm talking about multithreading programming.

Thankfully, JavaScript provides a built-in abstraction to mitigate the problem of shared resources across multiple threads. This mechanism is called Atomics.

In this article, you'll learn what shared resources look like in Node.js and how Atomics API helps us to prevent wild race conditions.

Shared memory between multiple threads

Let's start with understanding what transferable objects are.

Transferable objects are the objects that can be transferred from one execution context to another without holding to resources from the original context.

An execution context is a place where JavaScript code can be executed. To make it easier to understand, let's assume that an execution context is equal to a worker thread because each thread is indeed a separate execution context.

For example, ArrayBuffer is a transferable object. It consists of 2 parts: raw allocated memory and JavaScript handle to this memory. You can read the article about Buffers in JavaScript to learn more about this topic.

Whenever we transfer ArrayBuffer from the main thread to a worker thread, both components, the raw memory and JavaScript objects are recreated in the worker thread. There is no way you can access the same object reference or underlying memory of ArrayBuffer inside of the worker thread.

The only way to share resources between different threads is to use SharedArrayBuffer.

As the name suggests, it is designed to be shared. We consider this buffer to be a non-transferable object. If you try to pass SharedArrayBuffer from the main thread to a worker thread, only the JavaScript object gets recreated, but the memory region that it refers to is the same

While SharedArrayBuffer is a unique and powerful API it comes with a cost.

As Uncle Ben told us:

When we share resources between multiple threads, we expose ourselves to a whole new world of nasty race conditions.

Race conditions for shared resources

It would be easier to understand what I'm talking about with a particular example.

import { Worker, isMainThread } from 'node:worker_threads';

if (isMainThread) {
  new Worker(import.meta.filename);
  new Worker(import.meta.filename);
} else {
  // worker code
}

We're using the same file to run the main thread and worker threads. The block under isMainThread condition is executed only for the main thread. You might also notice import.meta.filename, it is ES6 alternative to __filename variable available since Node 20.11.0. Next, we introduce a shared resource and an operation over the shared resource.

import { Worker, isMainThread, workerData, threadId } from 'node:worker_threads';

if (isMainThread) {
  const buffer = new SharedArrayBuffer(1);
  new Worker(import.meta.filename, { workerData: buffer });
  new Worker(import.meta.filename, { workerData: buffer });
} else {
  const typedArray = new Int8Array(workerData);
  typedArray[0] = threadId;
  console.dir({ threadId, value: typedArray[0] });
}

We pass SharedArrayBuffer to each of the workers as workerData. Both workers change the first element of the buffer to its ID. Then, we log the first buffer element.

One of the workers will have ID equals to 1 and the other one to 2. Without reading any further, what are you expecting to see in the output when this code runs?

Here is the result.

# 1 type of results
{ threadId: 1, value: 2 }
{ threadId: 2: value: 2 }

# 2 type of results
{ threadId: 1, value: 1 }
{ threadId: 2: value: 1 }

# 3 type of results
{ threadId: 1, value: 1 }
{ threadId: 2: value: 2 }

Did you notice it? Why on earth do we have cases where the value is the same for both threads? If you think about it from the standpoint of a single-threaded program, we should see different values printed every time.

Even if we run this code asynchronously in a single thread, the only thing that could be possibly different is the order in which a result is printed, but not such a drastic difference in the final value.

What happens here is one of the threads assigns value right between these two lines:

  typedArray[0] = threadId;

  // one of the threads sneaks right in here and assign value

  console.dir({ threadId, value: typedArray[0] });

It goes like this:

  1. The First thread assigns a value to the shared buffer

  2. The second thread assigns a value to the shared buffer

  3. The first thread prints the result to the console

  4. The second thread prints the result to the console.

As you can see, it is easy to run into a race condition with as little as 10 lines of code when we have shared resources and multiple threads. That's why we need a mechanism that can make sure that one worker is not interrupting the workflow of another worker. The Atomics API was created exactly for this purpose.

Atomics

I want to emphasize that using Atomics is the only possible way to be 100% sure that you're not running into race conditions when dealing with multiple threads and shared resources between them.

The main purpose of Atomics is to make sure that a single operation is performed as a single, uninterruptible unit. In other words, it ensures that no other workers can get in the middle of currently executable operation and do their stuff, like we've seen before.

Let's rewrite the example with race conditions using Atomics.

import { Worker, isMainThread, workerData, threadId } from 'node:worker_threads';

if (isMainThread) {
  const buffer = new SharedArrayBuffer(1);
  new Worker(import.meta.filename, { workerData: buffer });
  new Worker(import.meta.filename, { workerData: buffer });
} else {
  const typedArray = new Int8Array(workerData);
  const value = Atomics.store(typedArray, 0, threadId);
  console.dir({ threadId, value });
}

We changed two things: how we save the value and how we read the saved value. Using Atomics, we can do both operations at the same time using the store function.

When you run this code, you won't see a case where both threads have the same value. They are always different.

[1, 1]
[2, 2]

[2, 2]
[1, 1]

We could use 2 operations instead of 1: store and load.

const typedArray = new Int8Array(workerData);
Atomics.store(typedArray, 0, threadId);
const value = Atomics.load(typedArray, 0);
console.dir({ threadId, value });

However, this approach is still prone to race conditions. The whole point of using Atomics is to make our operations atomic.

In this case, we want 2 operations to be executed as a single atomic operation: to save a value and to read this value. When we use store and load functions, we're actually doing 2 separate atomics operations, not 1.

That's why it is still possible to run into a race condition where code from one worker gets in between store and load calls from the other threads.

There are more than just 2 functions to Atomics, in the following article, we'll cover how to use more of its functions to build our own semaphore and mutex to make the work with shared resources even more convenient.

Conclusion

Node.js is all fun and good when there is only a single thread. If you introduce multiple threads and shared resources on top of it, you get an environment where race conditions are inevitable.

There is only one mechanism in JavaScript that allows you to mitigate these problems and avoid race conditions, it is called Atomics.

The idea of Atomics is to have operations that execute as a single unit that cannot be interrupted from the outside.

Thanks to such a design, we can be sure that whenever we use Atomics functions, there is no way for other threads to get somewhere inside of such operations.

innocces commented 1 week ago

Node.js 开发人员对执行 JavaScript 的单线程感到太舒服了。即使通过“worker_threads”引入多线程,您也会感到非常安全。

但是,当您将共享资源添加到多个线程时,情况就会发生变化。事实上,它是所有软件工程中最具挑战性的主题之一。我说的是多线程编程。

值得庆幸的是,JavaScript 提供了内置的抽象来缓解多线程共享资源的问题。这种机制称为Atomics

在本文中,您将了解 Node.js 中的共享资源是什么样子,以及“Atomics” API 如何帮助我们防止疯狂的竞争情况。

多线程之间共享内存

让我们首先了解什么是可转移对象。

可转移对象是可以从一个执行上下文转移到另一个执行上下文而无需保留原始上下文中的资源的对象。

执行上下文是可以执行 JavaScript 代码的地方。为了更容易理解,我们假设执行上下文等于工作线程,因为每个线程确实是一个单独的执行上下文。

例如,“ArrayBuffer”是一个可传输对象。它由两部分组成:原始分配的内存和该内存的 JavaScript 句柄。您可以阅读有关JavaScript中的缓冲区的文章来了解有关此主题的更多信息。

每当我们将 ArrayBuffer 从主线程传输到工作线程时,两个组件、原始内存和 JavaScript 对象都会在工作线程中重新创建。您无法在工作线程内访问“ArrayBuffer”的相同对象引用或底层内存。

在不同线程之间共享资源的唯一方法是使用“SharedArrayBuffer”。

顾名思义,它是为了共享而设计的。我们认为这个缓冲区是一个不可转移的对象。如果您尝试将“SharedArrayBuffer”从主线程传递到工作线程,则只会重新创建 JavaScript 对象,但它引用的内存区域是相同的

虽然“SharedArrayBuffer”是一个独特且强大的 API,但它是有成本的。

正如本叔叔告诉我们的:

当我们在多个线程之间共享资源时,我们将自己暴露在一个充满恶劣竞争条件的全新世界中。

共享资源的竞争条件

通过一个具体的例子会更容易理解我在说什么。

import { Worker, isMainThread } from 'node:worker_threads';

if (isMainThread) {
  new Worker(import.meta.filename);
  new Worker(import.meta.filename);
} else {
  // worker code
}

我们使用相同的文件来运行主线程和工作线程。 isMainThread 条件下的块仅针对主线程执行。您可能还会注意到“import.meta.filename”,它是自 Node 20.11.0 起可用的 ES6 替代品“__filename”变量。接下来介绍共享资源以及对共享资源的操作。

import { Worker, isMainThread, workerData, threadId } from 'node:worker_threads';

if (isMainThread) {
  const buffer = new SharedArrayBuffer(1);
  new Worker(import.meta.filename, { workerData: buffer });
  new Worker(import.meta.filename, { workerData: buffer });
} else {
  const typedArray = new Int8Array(workerData);
  typedArray[0] = threadId;
  console.dir({ threadId, value: typedArray[0] });
}

我们将“SharedArrayBuffer”作为“workerData”传递给每个工作人员。两个工作进程都将缓冲区的第一个元素更改为其 ID。然后,我们记录第一个缓冲区元素。

其中一名工人的 ID 等于“1”,另一名工人的 ID 等于“2”。无需进一步阅读,当此代码运行时,您希望在输出中看到什么?

这是结果。

# 1 type of results
{ threadId: 1, value: 2 }
{ threadId: 2: value: 2 }

# 2 type of results
{ threadId: 1, value: 1 }
{ threadId: 2: value: 1 }

# 3 type of results
{ threadId: 1, value: 1 }
{ threadId: 2: value: 2 }

你注意到了吗?到底为什么会出现两个线程的值相同的情况?如果您从单线程程序的角度考虑它,我们应该看到每次打印不同的值。

即使我们在单个线程中异步运行此代码,唯一可能不同的是打印结果的顺序,但最终值不会有如此巨大的差异。

这里发生的事情是线程之一在这两行之间赋值:

  typedArray[0] = threadId;

  // one of the threads sneaks right in here and assign value

  console.dir({ threadId, value: typedArray[0] });

事情是这样的:

1.第一个线程给共享缓冲区赋值

2.第二个线程给共享缓冲区赋值

3.第一个线程将结果打印到控制台

  1. 第二个线程将结果打印到控制台。

正如您所看到的,当我们共享资源和多个线程时,只需 10 行代码就很容易遇到竞争条件。这就是为什么我们需要一种机制来确保一个工作人员不会中断另一个工作人员的工作流程。 “Atomics” API 正是为此目的而创建的。

原子

我想强调的是,使用“Atomics”是唯一可能的方法,可以 100% 确定在处理多个线程及其之间的共享资源时不会遇到竞争条件。

“原子”的主要目的是确保单个操作作为单个、不间断的单元执行。换句话说,它确保没有其他工作人员可以介入当前可执行操作并完成他们的工作,就像我们之前看到的那样。

让我们使用“Atomics”重写具有竞争条件的示例。

import { Worker, isMainThread, workerData, threadId } from 'node:worker_threads';

if (isMainThread) {
  const buffer = new SharedArrayBuffer(1);
  new Worker(import.meta.filename, { workerData: buffer });
  new Worker(import.meta.filename, { workerData: buffer });
} else {
  const typedArray = new Int8Array(workerData);
  const value = Atomics.store(typedArray, 0, threadId);
  console.dir({ threadId, value });
}

我们改变了两件事:如何保存值以及如何读取保存的值。使用“Atomics”,我们可以使用“store”函数同时执行这两个操作。

当您运行此代码时,您不会看到两个线程具有相同值的情况。他们总是不同的。

[1, 1]
[2, 2]

[2, 2]
[1, 1]

我们可以使用 2 个操作而不是 1 个:“store”和“load”。

const typedArray = new Int8Array(workerData);
Atomics.store(typedArray, 0, threadId);
const value = Atomics.load(typedArray, 0);
console.dir({ threadId, value });

然而,这种方法仍然容易出现竞争条件。使用“原子”的全部目的是使我们的操作“原子化”。

在本例中,我们希望将 2 个操作作为单个原子操作执行:保存一个值并读取该值。当我们使用“store”和“load”函数时,我们实际上是在执行 2 个独立的原子操作,而不是 1 个。

这就是为什么仍然有可能遇到竞争条件,即一个工作线程的代码进入其他线程的“store”和“load”调用之间。

Atomics 不仅仅有 2 个函数,在下面的文章中,我们将介绍 [如何使用它的更多函数来构建我们自己的信号量和互斥体](https://pavel-romanov.com/building- semaphore-and-mutex-in-nodejs)使共享资源的使用更加方便。

## 结论

当只有一个线程时,Node.js 非常有趣且良好。如果您在其之上引入多个线程和共享资源,那么您将获得一个竞争条件不可避免的环境。

JavaScript 中只有一种机制可以帮助您缓解这些问题并避免竞争条件,它称为“原子”。

“原子”的想法是让操作作为一个单元执行,不能被外部中断。

由于这样的设计,我们可以确定每当我们使用“Atomics”函数时,其他线程都无法进入此类操作的某个位置。