Open hbellafkir opened 1 month ago
Hi! š If I remember correctly, we tried this approach earlier but abandoned it since it didnāt show much impact back then. However, that was during the early stages of our experiments, so it might be worth revisiting. Good suggestion! If we can get an SSL model trained on XCL, it would also provide an interesting opportunity for fine-tuning comparisons.
I'd love to discuss this furtherāI also remember your AL paper for bird sounds. Feel free to reach out via email if you're interested: lukas.rauch@uni-kassel.de
Thanks for opening this thread @hbellafkir. @lurauch I would also love to know what workflow you would recommend for fine-tinning? Would it required to build a birdset-like dataset? Or download model weights and work from another training pipeline? These foundation models trained on XCL seem very promising !
Hey @paulpeyret-biophonia, apologies for the delayed response, I just saw your message.
In BirdSet, there are various ways to approach fine-tuning. One option is to leverage some foundation model with pretrained weights (does not really matter if SSL or SL) and fine-tune it on the specifict train subsets for each test dataset. However, the issue of domain/covariate shift from focals to soundscapes in the test set is still tricky.
A more practical approach might be to take a slice (or multiple slices for cross-validation) of the test data to create a new, in-domain fine-tuning dataset. This would allow the trained model to adjust better to the test data (we did something like this here.
If you'd like, you can also contact me via mail - I'd be happy to discuss this in more detail!
Hi @lurauch, thanks for providing this excellent resource! Similar to the requests above, I think it would be useful to provide a notebook or script simply demonstrating how to use one of the provided checkpoints as a starting point for training/fine-tuning.
This might be very simple, for instance just downloading the checkpoint then adding a config that includes ckpt_path key, and using the built-in train.py script. However,
.ckpt
like the local checkpoint used in this config so I'm not sure how to create the appropriate checkpoint from Huggingface resourcesHey @sammlapp
I see, we should add an additional notebook with a brief guide and focus more on the fine-tuning aspects. A short explanation:
To run an LT
experiment, you can use this example configuration with eval.py
as follows:
python birdset/eval.py experiment="birdset_neurips24/$EXPERIMENT_PATH"
This setup performs only inference and logit masking using larger models. The HF checkpoint is loaded through module.network.model.checkpoint
. You can modify the checkpoint path in the MultilabelModule
when preparing the model for training (with or without hydra).
Currently, we donāt have specific configurations for pretraining on, e.g., XCL, and fine-tuning on specific subsets (this is an interesting direction to explore further). However, you could just load the model checkpoint in DT
experiments to fine-tune. Right now, we support only full model fine-tuningālayer-specific fine-tuning (e.g., last layer only) is in development, as we plan to integrate SSL models soon.
For DT
experiments, you can utilize HF pre-trained checkpoints directly, substituting them with our own checkpoints for re-training. This is an example model configuration where you can see that we just use a checkpoint from HF.
The ckpt
files are rather useful for training with Lightning; they capture all model parameters and other snapshots, enabling training resumption. Alternatively, you can, of course, load the modelās state_dict
directly, bypassing HF if HF compatibility isn't required (this is especially relevant for SSL models, which we load with state_dict
manually).
Generally, loading a model checkpoint from HF is straightforward (and right now the intended way), as shown in this example:
print(f"Loading only HF model from {self.checkpoint}")
self.model = ASTForAudioClassification.from_pretrained(
self.checkpoint,
num_labels=self.num_classes,
cache_dir=self.cache_dir,
ignore_mismatched_sizes=True,
)
EDIT:
If i remember correctly @raphaelschwinger started to work with more fine-tuning / linear embedding models in BirdSet. Maybe he can add something here :)
Hi all, I am currently working on fine-tuning audio models on BirdSet data. I hope to share results and runnable code next week!
Thanks. It took me a while to realize I could use the classes from the transforms
packages and their load_pretrained() method to load from Hugging Face. The code examples provided on the Hugging Face page don't work.
I'm trying to directly use the model with pytorch, and have eventually figured out how to run inference, but there are two pieces of missing information for the checkpoint files: class list, and preprocessing settings
(1) the class list (>9000 classes) of the model does not seem to match the .txt file listing BirdNET classes (~6000 classes). How should I obtain the class list for any HF model checkpoint?
(2) I'm using the default preprocess configuration by initializing BirdSetTransformsWrapper() without arguments. How should it be initialized to ensure that the settings match those expected by any given HF model checkpoint?
from birdset.datamodule.components.transforms import BirdSetTransformsWrapper
import librosa
from transformers import ConvNextForImageClassification
import torch
from glob import glob
from pathlib import Path
model = ConvNextForImageClassification.from_pretrained(
"DBD-research-group/ConvNeXT-Base-BirdSet-XCL",
cache_dir=".",
ignore_mismatched_sizes=True,
)
# how to pass a config file that corresponds to the model checkpoint's preprocessing?
preprocessor = BirdSetTransformsWrapper()
# call transform_values() with batch where batch is dict with key "audio", and items in "audio" are dict with key "array" of audio samples
audio_files = glob("/home/sml161/sample_audio/*.mp3")
all_samples = []
clip_source_files = []
# run inference on each audio file
for audio_file in audio_files:
samples, sr = librosa.load(audio_file, sr=32000)
# divide into 5s segments
clip_samples = 5 * sr
n_clips = len(samples) // clip_samples
samples = [
{"array": samples[i * clip_samples : (i + 1) * clip_samples]}
for i in range(n_clips)
]
clip_source_files.extend([Path(audio_file)] * n_clips)
all_samples.extend(samples)
batch = {"audio": all_samples,"labels": [],}
samples, labels = preprocessor.transform_values(batch)
samples.shape #torch.Size([273, 1, 128, 1024])
model.eval()
model.to("cuda")
samples = samples.to("cuda")
labels = labels.to("cuda")
with torch.no_grad():
outs = model(samples)
outs.logits.shape # torch.Size([273, 9736])
Thanks for your help
Hey @sammlapp
The class list of pre-trained models corresponds to the datasets they were trained on (same indices). To get the class list, you can visit this link on HF or use the following code example:
import datasets
dataset_meta = datasets.load_dataset_builder("dbd-research-group/BirdSet", "XCL")
dataset_meta.info.features["ebird_code"]
This is also a good point that we should add this at least to the models on HF.
You are correct that there is no direct link between the pre-trained model checkpoint and the corresponding preprocessing settings. In light of the current interest, we should make this explicit and not keep it obscured within the experiments section.
You could, for example use the respective settings from an LT ConvNext experiment: and manually adjust the parameters. Here, we use the bird_default_multilabel.yaml
for the transforms.
As far as I kow, you could also initialize the preprocessor = BirdSetTransformsWrapper()
(externally) with Hydra:
import os
import hydra
from omegaconf import DictConfig, OmegaConf
from birdset.datamodule.components.transforms import BirdSetTransformsWrapper
hydra.initialize(config_path="configs") # relative to current working dir
os.environ['PROJECT_ROOT'] = 'root'
# load the experiment config
cfg = hydra.compose(config_name="train", overrides=["experiment=birdset_neurips24/HSN/LT/convnext"])
# get the transform config
transform_cfg = cfg.datamodule.transforms
# to allow del
OmegaConf.set_struct(transform_cfg, False)
# delete background noise since an error could occur when the path has no files (even if it isn't used)
# we should change this
del transform_cfg.waveform_augmentations["background_noise"]
# initialize the transform
preprocessor = hydra.utils.instantiate(transform_cfg)
Thank you, this is very helpful. So it seems there is a one-to-one correspondence between the configs in this folder and the huggingface models.
(btw I think there's a typo, I needed ebird_code
rather than ebird_codes
)
Ok, so for the sake of other users looking to use the pre-trained HuggingFace models, here is a working (if not very elegant) script to run inference with the correct preprocessing and pretrained weights/architecture. (As for fine-tuning, I'm hoping that it will not be too hard to use either the birdset configs/scripts or pure pytorch now that inference is working).
note that I copied the entire "configs" folder from the birdset repo to the script's directory
@lurauch if there is a more elegant/recommended way to use the pre-trained huggingface models outside of the birdset repo, an example would be great.
import datasets
dataset_meta = datasets.load_dataset_builder("dbd-research-group/BirdSet", "XCL")
classes = dataset_meta.info.features["ebird_code"]
class_list = classes.names
import numpy as np
import pandas as pd
from glob import glob
from pathlib import Path
import os
import hydra
from omegaconf import DictConfig, OmegaConf
# from birdset.datamodule.components.transforms import BirdSetTransformsWrapper
import librosa
from transformers import ConvNextForImageClassification
import torch
from pathlib import Path
## create the preprocessor ##
dataset_meta = datasets.load_dataset_builder("dbd-research-group/BirdSet", "XCL")
classes = dataset_meta.info.features["ebird_code"]
class_list = np.array(classes.names)
# note that I copied the entire "configs" folder from the birdset repo to the current working directory
hydra.initialize(config_path="configs") # relative to current working dir
os.environ["PROJECT_ROOT"] = "root"
# load the experiment config
cfg = hydra.compose(
config_name="train", overrides=["experiment=birdset_neurips24/XCL/convnext"]
)
# get the transform config
transform_cfg = cfg.datamodule.transforms
# to allow del
OmegaConf.set_struct(transform_cfg, False)
# delete background noise since an error could occur when the path has no files (even if it isn't used)
# we should change this
del transform_cfg.waveform_augmentations["background_noise"]
# initialize the transform
preprocessor = hydra.utils.instantiate(transform_cfg)
## create model object from pre-trained Hugging Face model
model = ConvNextForImageClassification.from_pretrained(
"DBD-research-group/ConvNeXT-Base-BirdSet-XCL",
cache_dir=".",
ignore_mismatched_sizes=True,
)
# using `preprocessor`:
# call transform_values() with batch where batch is dict with key "audio", and items in "audio" are dict with key "array" of audio samples
audio_files = [
"~/woth_5s.wav",
]
all_samples = []
clip_source_files = []
# prepare a batch of audio samples
for audio_file in audio_files:
samples, sr = librosa.load(audio_file, sr=32000)
# divide into 5s segments
clip_samples = 5 * sr
n_clips = len(samples) // clip_samples
samples = [
{"array": samples[i * clip_samples : (i + 1) * clip_samples]}
for i in range(n_clips)
]
clip_source_files.extend([Path(audio_file)] * n_clips)
all_samples.extend(samples)
batch = {
"audio": all_samples,
"labels": torch.zeros(len(all_samples)).unsqueeze(1),
}
# use the preprocessor to create the input tensors (spectrograms) and labels
# in the model's expected format
samples, labels = preprocessor.transform_values(batch)
# run a forward pass on the batch
model.eval()
device = torch.device("cuda:0")
model.to(device)
samples = samples.to(device)
labels = labels.to(device)
with torch.no_grad():
outs = model(samples)
logits = outs.logits.detach().cpu().numpy()
# check which classes were detected
np.array(class_list)[logits[1] > 0] # correctly detects wood thrush
Thank you for sharing your code! Would you be okay with us including a version of it in our tutorial notebooks?
Regarding the utilization of pre-trained models: Hugging Face addresses this with a built-in preprocess method for each model. Unfortunately, we found this approach not flexible enough for our case. We should still look into this again. Currently, I donāt have a more āelegantā method for utilizing the models, but Iām actively working on something towards fine-tuning in SSL (where I also have to load in the SL trained models for comparison). Iāll share an update as soon as I develop a better solution.
Thanks, yes feel free to add this script or a variant of it. Also, my motivation for getting this running is ultimately to add access to these models to the bioacoustics model zoo, if @lurauch and the other project owners support this idea. I hope that I'll be able to provide a simple API for both inference and fine-tuning of the pretrained models.
The model zoo sounds great! I could definitely work on decoupling the models from this repository to make them more suitable for a general-purpose setting (and utilize parts of the repo for fine-tuning). Perhaps we can discuss it in more detail so I can better understand the requirements to ensure integration with your model zoo?
sure, just email me to set up a time
Thanks for sharing your work. Did you try fine-tuning on relevant species after pre-training on XCL?