DavidDiazGuerra / icoDOA

Code repository for the paper Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs
GNU Affero General Public License v3.0
30 stars 9 forks source link

Optimal training strategy and array shape #2

Closed JuanFMontesinos closed 1 year ago

JuanFMontesinos commented 1 year ago

Hi, thanks again for the paper. From the paper it's clear that for values r>2 won't improve the model. The training strategy is then defined in the paper as:

Similar to the curriculum learning [47] strategy employed in [27], we keep fixed the SNR of the
simulations to 30 dB during the first 25 epochs and then we employed uniformly distributed random values from 5 dB to
30 dB in the following epochs.

Since we are expecting to train the model with different dry sounds instead of speech,

  1. I was wondering how easy is making the model to converge.

  2. How did you end up choosing the epoch 25 to reduce the SNR (did the model reach a plateau...?)

  3. Also, how easy is to find the proper learning rate.

  4. Is it possible to move the array while training? +

  5. Do you think it would generalize in that case? The points chosen to make the trajectories are supposed to be totally random, but this can eventually violate the near field assumption for the beamforming algorithm.

  6. Does it work well despite that?

  7. In general, your experience to find optimal hyperparameters.

Lastly, I've seen there are several array shapes coded.

  1. Does the code actually work with all of them and custom ones?

Sorry for asking so many questions :) I used numbers (despite being a bit rude) to sort the content a little.

Best Juan

DavidDiazGuerra commented 1 year ago

Hi Juan,

Don't worry about the numbers, I agree they're the easiest way to order the questions. I'll try to answer all of them:

  1. I was wondering how easy is making the model to converge.

In the beginning, I had some convergence problems but after adding the layer normalization it converged without issues.

  1. How did you end up choosing the epoch 25 to reduce the SNR (did the model reach a plateau...?)

The choice of epoch 25 to decrease the SNR was quite random and I don't remember the reasons right now, it could probably be optimized.

  1. Also, how easy is to find the proper learning rate.

I did not conduct many experiments about this, I think I just tried to increase it by a factor x10 and reduce it by a factor /10 and both worked worse than the current one so I just left it. Again, this can probably be optimized.

  1. Is it possible to move the array while training? +
  2. Do you think it would generalize in that case?

Moving the array during the training scenes should be possible, but it might be difficult to implement since gpuRIR, the library I used to simulate the room acoustics, is designed to simulate moving sources with static arrays. However, I wouldn't expect the movement of the array to have an impact on the SRP-PHAT maps very different from the movement of the sources (especially in single-source scenarios), so the current training probably generalizes well to the movement of the array.

The points chosen to make the trajectories are supposed to be totally random, but this can eventually violate the near field assumption for the beamforming algorithm.

  1. Does it work well despite that?

There are no distance limits in the simulation of the training scenes, so some of them might indeed be nearfield and lead to wrong SRP-PHAT maps. I hadn't thought too much about this, but maybe it would be interesting to analyze if the distance from the source to the array affects the localization accuracy of the trained models.

  1. In general, your experience to find optimal hyperparameters.

Tbh, I didn't conduct many experiments to optimize the hyperparameters of the model, so there's for sure room for optimization there. Right now most I can say is that you can reduce the number of convolution channels and still get some quite competitive results; these are the results I got about this in some experiments I did after publishing the paper:

imagen

Lastly, I've seen there are several array shapes coded.

  1. Does the code actually work with all of them and custom ones?

I haven't tested it but I think it should.

And just a final comment about this:

From the paper it's clear that for values r>2 won't improve the model.

This is definitely true for the array I used in the paper, but using different arrays, especially if they're bigger, it is possible that higher resolution maps actually contain additional information that could lead to a better localization accuracy of the model.

Best regards, David

JuanFMontesinos commented 1 year ago

Perfect! Thank you for kindly replying all my questions :)

And again congrats for the paper!