Explore Baseline Models

lurauch commented 11 months ago

Models for spectrograms:

ConvNeXT: A pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them. (https://huggingface.co/docs/transformers/model_doc/convnext#:~:text=ConvNeXT%20is%20a%20pure%20convolutional,art%20image )
Swin Transformer: A hierarchical vision transformer, that employs shifted windows to compute representations. (https://huggingface.co/docs/transformers/model_doc/swin )
Audio Spectrogram Transformer (AST): Applies a Vision Transformer to audio, by turning audio into an image (spectrogram) and obtains state-of-the-art results for audio classification. (https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#:~:text=In%20this%20paper%2C%20we%20answer,50%2C%20and%2098.1%25 )
ResNet: A convolutional neural network that employs residual connections to train much deeper models, facilitating the training of networks with an unprecedented number of layers (up to 1000). (https://huggingface.co/docs/transformers/model_doc/resnet#:~:text=ResNet%20Overview,5%E2%80%9D )
EfficientNet: A model designed to uniformly scale the network width, depth, and resolution using a compound coefficient, which enables it to achieve state-of-the-art accuracy in image classification tasks while being smaller and faster than previous models (https://huggingface.co/docs/transformers/model_doc/efficientnet#:~:text=EfficientNet%20Overview,the%20paper%20is%20the%20following)

Models for waveforms:

Wav2Vec2: Learns powerful representations from speech audio, by solving a contrastive task, which can then be fine-tuned on transcribed speech. (https://huggingface.co/docs/transformers/model_doc/wav2vec2#:~:text=Wav2Vec2%20Overview,on%20transcribed%20speech%20can ). There is also one Wav2Vec2 model already trained on bird sounds (https://huggingface.co/Saads/bird_classification_model#:~:text=,Training%20procedure%20Training%20hyperparameters ) --> Nutzt eigenen Wav2Vec2FeatureExtractor
Wav2Vec2-Conformer: Follows the same architecture as Wav2Vec2, but replaces the Attention-block with a Conformer-block (https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer ) --> Nutzt AutoFeatureExtractor
Hubert: An approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. (https://huggingface.co/docs/transformers/model_doc/hubert#:~:text=Hubert%20Overview,by%20three%20unique%20problems ) --> Nutzt AutoFeatureExtractor
(Whisper): Pre-trained model for automatic speech recognition (ASR) and speech translation, trained on a large dataset and capable of generalizing to many datasets and domains without the need for fine-tuning. (https://huggingface.co/docs/transformers/model_doc/whisper ) --> WhisperFeatureExtractor extracts mel-filter bank features from raw speech
UniSpeech: Learns speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. (https://huggingface.co/docs/transformers/model_doc/unispeech#:~:text=UniSpeech%20is%20a%20speech%20model,to%20be%20decoded%20using%20Wav2Vec2CTCTokenizer ) --> Nutzt AutoFeatureExtractor