Implementation of the MUSIC algorithm

AvioGG commented 1 year ago

Hi fakufaku, Firstly, thanks for the great library.

I am currently using the multiple signal classification (MUSIC) algorithm for DoA estimation and it works pretty well for both narrowband signals and even for wideband signals.

However, after looking into the source code, I have few questions regarding the implementation. The code snippet below is taken from doa/music.py file. Here the main issue I have is the tensor denoted by X, which is defined as the STFT of the input signal with shape (M x n_fft/2+1 x num_frames), with M being the number of microphones.

Why do we need this tensor X in the first place and why do we take the autocorrelation of STFT frames?

The MUSIC algorithm provided by the original paper (and anywhere else as far as I know) does not use any frequency representation as input. It only calculates the autocorrelation matrix in time domain and uses the steering vector which is known due to the geometry of the setup then follows it by eigenvalue decomposition.

def _process(self, X):
    """
    Perform MUSIC for given frame in order to estimate steered response
    spectrum.
    """
    # compute steered response
    self.Pssl = np.zeros((self.num_freq, self.grid.n_points))
    C_hat = self._compute_correlation_matricesvec(X)
    # subspace decomposition
    Es, En, ws, wn = self._subspace_decomposition(C_hat[None, ...])
    # compute spatial spectrum
    identity = np.zeros((self.num_freq, self.M, self.M))
    identity[:, list(np.arange(self.M)), list(np.arange(self.M))] = 1
    cross = identity - np.matmul(Es, np.moveaxis(np.conjugate(Es), -1, -2))
    self.Pssl = self._compute_spatial_spectrumvec(cross)
    if self.frequency_normalization:
        self._apply_frequency_normalization()
    self.grid.set_values(np.squeeze(np.sum(self.Pssl, axis=1) / self.num_freq))

# vectorized version
def _compute_correlation_matricesvec(self, X):
    # change X such that time frames, frequency microphones is the result
    X = np.transpose(X, axes=[2, 1, 0])
    # select frequency bins
    X = X[..., list(self.freq_bins), :]
    # Compute PSD and average over time frame
    C_hat = np.matmul(X[..., None], np.conjugate(X[..., None, :]))
    # Average over time-frames
    C_hat = np.mean(C_hat, axis=0)
    return C_hat

# vectorized version
def _subspace_decomposition(self, R):
    # eigenvalue decomposition!
    # This method is specialized for Hermitian symmetric matrices,
    # which is the case since R is a covariance matrix
    w, v = np.linalg.eigh(R)

    # This method (numpy.linalg.eigh) returns the eigenvalues (and
    # eigenvectors) in ascending order, so there is no need to sort Signal
    # comprises the leading eigenvalues Noise takes the rest

    Es = v[..., -self.num_src :]
    ws = w[..., -self.num_src :]
    En = v[..., : -self.num_src]
    wn = w[..., : -self.num_src]

    return (Es, En, ws, wn)

def _compute_spatial_spectrumvec(self, cross):
    mod_vec = np.transpose(
        np.array(self.mode_vec[self.freq_bins, :, :]), axes=[2, 0, 1]
    )
    # timeframe, frequ, no idea
    denom = np.matmul(
        np.conjugate(mod_vec[..., None, :]), np.matmul(cross, mod_vec[..., None])
    )
    return 1.0 / abs(denom[..., 0, 0])

After looking into various sources related to MUSIC in Wikipedia, and the original paper, I see no FFT operations or spectrogram calculations. In addition, I have also implemented the MUSIC algorithm for a uniform linear array using the description from the paper and it also works properly.

So my questions are:

Why the implementation above works, and how does it work without the need of a steering vector which is calculated from the information of microphone geometry? If there exist any papers related to this implementation or theory, I would really appreciate it.
Is there a way to generate candidate steering vectors for arbitrary microphone array placements? (I think you are doing this in some way since it only takes STFT frames as input and somehow also use MUSIC)

fakufaku commented 1 year ago

Hi @AvioGG , that's a great question.

The thing about the original MUSIC paper, wikipedia, etc, is that they all assume that the signal is narrow-band. While the assumption is usually respected in radio applications (radar, etc), it is not the case for of speech.
In addition, speech signals are real valued and sampled so it is difficult to exploit fractional delays between the channels in the input signal. In the frequency domain, this becomes a phase difference between channels, which admits fractional values. The fractional delay between channels can be represented by a short filter. After going to freq. domain, this becomes a simple point-wise multiplication by the complex exponential with exponents depending on frequency band, mic location and source direction.

Thus, in practice, the real valued input signal is first transformed to the STFT domain, where the narrow band assumption can be used in each of the subbands. Then the DOA algorithm (MUSIC, SRP, etc) is run per-freq. band, and the results are combined.

For a pretty good reference about the practical implementations of these algorithms, you can see I. Tashev, "Sound Capture and Processing", 2009.

AvioGG commented 1 year ago

Thanks, The implementation makes much more sense now. Also, huge thanks for the book you mentioned. Chapter 5 and 6 are indeed what I was looking for in terms of implementation and general theoretical information.

fakufaku commented 1 year ago

Fantastic! Happy to help! Please close the issue if that solved your problem! 😄

LCAV / pyroomacoustics

Implementation of the MUSIC algorithm #269