cvqluu / simple_diarizer

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code
GNU General Public License v3.0
141 stars 27 forks source link

Clustering kwargs exposed #14

Open andrewmackie opened 1 year ago

andrewmackie commented 1 year ago

I have exposed the kwargs for all of the sklearn-based clustering algorithms so that they can be called from cluster_SC(), cluster_AHC(), Diarizer.diarize() and the command line.

All kwargs available in the sklearn algorithms should be available. I noted that you have some default values for kwargs and have retained those.

I haven't done comprehensive testing. I won't be offended if you want to change the way it is implemented.

FYI, the reason I did this was that 'arpack' eigen solver in sklearn.cluster.SpectralClustering falls over when attempting to cluster a large number (>2k) of embeddings. Using the 'lobpcg' eigen solver appears to address this problem, but the eigen_solver kwarg could not be set from Diarizer.diarize() - now it can.

andrewmackie commented 1 year ago

I've realised that when calling the kwargs from the command line, all of the kwarg values will be received as strings - some will need to be converted.

The most thorough method of doing this would probably be to:

  1. create a dictionary which contains the types of the known clustering kwargs and convert the values into those types, and
  2. guess the type of any unknown kwargs, e.g. foo=True -> {'foo': True}

Please let me know if you would like me to do this (I'm very happy for you to do it as well).