DBD-research-group / BirdSet

A benchmark dataset collection for bird sound classification
https://huggingface.co/datasets/DBD-research-group/BirdSet
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Explore Baseline Models #35

Closed lurauch closed 9 months ago

lurauch commented 11 months ago

Models for spectrograms:

  1. ConvNeXT: A pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them. (https://huggingface.co/docs/transformers/model_doc/convnext#:~:text=ConvNeXT%20is%20a%20pure%20convolutional,art%20image )

  2. Swin Transformer: A hierarchical vision transformer, that employs shifted windows to compute representations. (https://huggingface.co/docs/transformers/model_doc/swin )

  3. Audio Spectrogram Transformer (AST): Applies a Vision Transformer to audio, by turning audio into an image (spectrogram) and obtains state-of-the-art results for audio classification. (https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#:~:text=In%20this%20paper%2C%20we%20answer,50%2C%20and%2098.1%25 )

  4. ResNet: A convolutional neural network that employs residual connections to train much deeper models, facilitating the training of networks with an unprecedented number of layers (up to 1000). (https://huggingface.co/docs/transformers/model_doc/resnet#:~:text=ResNet%20Overview,5%E2%80%9D )

  5. EfficientNet: A model designed to uniformly scale the network width, depth, and resolution using a compound coefficient, which enables it to achieve state-of-the-art accuracy in image classification tasks while being smaller and faster than previous models (https://huggingface.co/docs/transformers/model_doc/efficientnet#:~:text=EfficientNet%20Overview,the%20paper%20is%20the%20following)

Models for waveforms:

  1. Wav2Vec2: Learns powerful representations from speech audio, by solving a contrastive task, which can then be fine-tuned on transcribed speech. (https://huggingface.co/docs/transformers/model_doc/wav2vec2#:~:text=Wav2Vec2%20Overview,on%20transcribed%20speech%20can ). There is also one Wav2Vec2 model already trained on bird sounds (https://huggingface.co/Saads/bird_classification_model#:~:text=,Training%20procedure%20Training%20hyperparameters ) --> Nutzt eigenen Wav2Vec2FeatureExtractor

  2. Wav2Vec2-Conformer: Follows the same architecture as Wav2Vec2, but replaces the Attention-block with a Conformer-block (https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer ) --> Nutzt AutoFeatureExtractor

  3. Hubert: An approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. (https://huggingface.co/docs/transformers/model_doc/hubert#:~:text=Hubert%20Overview,by%20three%20unique%20problems ) --> Nutzt AutoFeatureExtractor

  4. (Whisper): Pre-trained model for automatic speech recognition (ASR) and speech translation, trained on a large dataset and capable of generalizing to many datasets and domains without the need for fine-tuning. (https://huggingface.co/docs/transformers/model_doc/whisper ) --> WhisperFeatureExtractor extracts mel-filter bank features from raw speech

  5. UniSpeech: Learns speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. (https://huggingface.co/docs/transformers/model_doc/unispeech#:~:text=UniSpeech%20is%20a%20speech%20model,to%20be%20decoded%20using%20Wav2Vec2CTCTokenizer ) --> Nutzt AutoFeatureExtractor

Weitere mögliche Modelle für waveforms: Data2VecAudio, SEW, SEW-D, UniSpeechSat, WavLM (https://huggingface.co/docs/transformers/tasks/audio_classification#:~:text=The%20task%20illustrated%20in%20this,Conformer%E3%80%91%2C%20%E3%80%90114%E2%80%A0WavLM%E3%80%91%2C%20%E3%80%90115%E2%80%A0Whisper%E3%80%91 )

lurauch commented 11 months ago
lurauch commented 10 months ago

paper: TOWARDS LEARNING UNIVERSAL AUDIO REPRESENTATIONS

lurauch commented 9 months ago
lurauch commented 9 months ago
reheinrich commented 9 months ago

Code & models pre-trained on AudioSet available:

Code & models pre-trained on audio datasets other than AudioSet available:

Only code available, but pre-trained model weights on audio are NOT available: