JusperLee / SonicSim

http://cslikai.cn/SonicSim/
Creative Commons Attribution Share Alike 4.0 International
203 stars 25 forks source link

A question about the dataset construction #6

Open lvmuchun opened 6 days ago

lvmuchun commented 6 days ago

Hi,

How do you generate the moving sound source? or how to calculate interpolation weights? I can't find the relevant introduction. Thanks

JusperLee commented 2 days ago

SonicSim selects the starting and endpoint positions of the sound source in a 3D space and then uses Habitat-sim's path-finding API to generate a navigable trajectory. The trajectory points are used to construct a smooth interpolation path and calculate the corresponding interpolation weights, enabling realistic spatial movement effects during audio synthesis. Assuming the source audio signal is $\mathbf{s}(t)$ with duration $T$, and the room impulse responses (RIRs) $\mathbf{h}_j^c(t)$ describe the transmission characteristics from each position $j$ to the receiver, where $j = 1, 2, \dots, N$ represents $N$ discrete positions and $c$ is the audio channel index, the process is as follows:

  1. Convolution at Discrete Positions:
    The source signal is convolved with the RIR $\mathbf{h}_j^c(t)$ at each position $j$, resulting in the convolution response: $\mathbf{y}_j^c(t) = \mathbf{s}(t) \mathbf{h}_j^c(t),$ where $$ denotes the convolution operation.

  2. Linear Interpolation Between Positions:
    Since moving sound sources transition smoothly between discrete positions, interpolation between the convolution results of adjacent positions is necessary. We introduce a linear interpolation weight $\alpha(t)$, which indicates the degree of transition between positions $j$ and $j+1$. The weight $\alpha(t)$ ranges from $0$ (fully at position $j$) to $1$ (fully at position $j+1$). The interpolation weights are calculated based on the Euclidean distance between the moving source's current position and the neighboring positions: $\alpha(t) = \frac{d j - \text{dist}(\mathbf{r} j, \mathbf{r} _ t)}{dj - d {j+1}} $ where $dj$ and $d{j+1}$ are the spatial distances between positions $j$ and $j+1$, and $\text{dist}(\mathbf{r}_j, \mathbf{r}_t)$ is the distance between the receiver's current position $\mathbf{r}_t$ and position $j$.

  3. Weighted Average of Convolution Results:
    At each time step $t$, the interpolated audio signal is computed as the weighted average of the convolution results at adjacent positions: $\mathbf{y}(t) = (1 - \alpha(t)) \cdot \mathbf{y} {\mathbf{i}(t)}^c(t) + \alpha(t) \cdot \mathbf{y} {\mathbf{i}(t)+1}^c(t),$ where $\mathbf{i}(t)$ denotes the position index at time $t$, $\mathbf{y} {\mathbf{i}(t)}^c(t)$ and $\mathbf{y} {\mathbf{i}(t)+1}^c(t)$ are the convolution results at positions $j$ and $j+1$, respectively, and $\alpha(t)$ is the interpolation weight.