A question about the dataset construction

SonicSim selects the starting and endpoint positions of the sound source in a 3D space and then uses Habitat-sim's path-finding API to generate a navigable trajectory. The trajectory points are used to construct a smooth interpolation path and calculate the corresponding interpolation weights, enabling realistic spatial movement effects during audio synthesis. Assuming the source audio signal is $\mathbf{s}(t)$ with duration $T$, and the room impulse responses (RIRs) $\mathbf{h}_j^c(t)$ describe the transmission characteristics from each position $j$ to the receiver, where $j = 1, 2, \dots, N$ represents $N$ discrete positions and $c$ is the audio channel index, the process is as follows:

Convolution at Discrete Positions:
The source signal is convolved with the RIR $\mathbf{h}_j^c(t)$ at each position $j$, resulting in the convolution response: $\mathbf{y}_j^c(t) = \mathbf{s}(t) \mathbf{h}_j^c(t),$ where $$ denotes the convolution operation.
Linear Interpolation Between Positions:
Since moving sound sources transition smoothly between discrete positions, interpolation between the convolution results of adjacent positions is necessary. We introduce a linear interpolation weight $\alpha(t)$, which indicates the degree of transition between positions $j$ and $j+1$. The weight $\alpha(t)$ ranges from $0$ (fully at position $j$) to $1$ (fully at position $j+1$). The interpolation weights are calculated based on the Euclidean distance between the moving source's current position and the neighboring positions: $\alpha(t) = \frac{d j - \text{dist}(\mathbf{r} j, \mathbf{r} _ t)}{dj - d {j+1}} $ where $dj$ and $d{j+1}$ are the spatial distances between positions $j$ and $j+1$, and $\text{dist}(\mathbf{r}_j, \mathbf{r}_t)$ is the distance between the receiver's current position $\mathbf{r}_t$ and position $j$.
Weighted Average of Convolution Results:
At each time step $t$, the interpolated audio signal is computed as the weighted average of the convolution results at adjacent positions: $\mathbf{y}(t) = (1 - \alpha(t)) \cdot \mathbf{y} {\mathbf{i}(t)}^c(t) + \alpha(t) \cdot \mathbf{y} {\mathbf{i}(t)+1}^c(t),$ where $\mathbf{i}(t)$ denotes the position index at time $t$, $\mathbf{y} {\mathbf{i}(t)}^c(t)$ and $\mathbf{y} {\mathbf{i}(t)+1}^c(t)$ are the convolution results at positions $j$ and $j+1$, respectively, and $\alpha(t)$ is the interpolation weight.

JusperLee / SonicSim

A question about the dataset construction #6