Using pre-trained Ivector extractors to extract features for each Audio (one speaker)

I have been looking for a way to use a pre-trained ivector extractor to generate features for audios later to use for a binary classification task across all speakers. I have 4 audios of each speaker (each with different emotional affect) and would create a set of features for each of these audios.

I was originally pointed towards Kaldi, but I am unable to set it up on my cluster and need to resort to something running in python and that doesn't requires sudo (rules out pykaldi). Is it possible to do this with MFA?

Additionally does the training of the ivectors work because I would like to try to train them on the data that I have or a speech corpus of spanish language (the language of the audios)? If it does work do you have an example you can point me to? My audios are .wav files.

Also am I able to append my own features in addition to MFCC to train the ivectors? My understanding is that this can be done if i concatenate my features on the same timeframe (for example jitter or shimmer frame by frame). Would this be possible with MFA?

I am also open to any other suggestions or libraries. @mmcauliffe seems to be an expert on python implementations of Kaldi.

Thank you for your time and help.

MontrealCorpusTools / Montreal-Forced-Aligner

Using pre-trained Ivector extractors to extract features for each Audio (one speaker) #725