KeyError when initializing XCL DataModule

sammlapp commented 2 weeks ago

Unlike test datasets, XCL seems to have a different structure that causes an error when initializing BirdSetDataModule:

from birdset.datamodule import DatasetConfig
from birdset.datamodule.birdset_datamodule import BirdSetDataModule

# download a complete xeno canto snapshot included in BirdSet
# https://huggingface.co/datasets/DBD-research-group/BirdSet

# initiate the data module
dm = BirdSetDataModule(
    dataset=DatasetConfig(
        data_dir=".../data_birdset/",
        hf_path="DBD-research-group/BirdSet",
        hf_name="XCL",
        n_workers=4,
        val_split=0.2,
        task="multilabel",
        classlimit=500,
        eventlimit=5,
        sampling_rate=32000,
    ),
)
# prepare the data (download dataset, ...)
dm.prepare_data()

results in

Traceback (most recent call last):
  File "/home/sml161/birdset_download/download_XCL.py", line 22, in <module>
    dm.prepare_data()
  File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 120, in prepare_data
    dataset = self._preprocess_data(dataset)
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in _preprocess_data
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in <dictcomp>
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/dataset_dict.py", line 75, in __getitem__
    return super().__getitem__(k)
KeyError: 'test_5s'

Oddly, I don't see anything in the example XCL training scripts that differs from how I am using XCL here. Can you help me understand what I'm doing wrong?

lurauch commented 2 weeks ago

As I mentioned in your other issue, XCM/XCL do not have a test split available. You need another datamodule. You can see this in the data config or in the experiment configs, that use XCM/XCL, e.g.: example 1, example 2

sammlapp commented 2 weeks ago

Thank you, it makes sense that XCL doesn't have a test split, but I'm not sure what about the configs for training on XCM/XCL that you linked will resolve the particular error I get above. Does it have to do with the "loaders:" section? https://github.com/DBD-research-group/BirdSet/blob/95921e953fee3cf9c5b395cc9dfb706db0149fcf/configs/experiment/birdset_neurips24/XCL/ast.yaml#L54

Part of my confusion may be because I'm unfamiliar with the deeply nested config structure of this repository. For instance, I don't know how to write a python script that would use the config file in the examples you linked.

lurauch commented 1 week ago

Hey @sammlapp, sorry for late notice. I will review your comments in more detail at the beginning of next week. I was busy with some other tasks the whole week. It might also be helpful to schedule a meeting, as discussing this in person could be more effective :)

lurauch commented 4 days ago

Thank you, it makes sense that XCL doesn't have a test split, but I'm not sure what about the configs for training on XCM/XCL that you linked will resolve the particular error I get above. Does it have to do with the "loaders:" section?

https://github.com/DBD-research-group/BirdSet/blob/95921e953fee3cf9c5b395cc9dfb706db0149fcf/configs/experiment/birdset_neurips24/XCL/ast.yaml#L54

Part of my confusion may be because I'm unfamiliar with the deeply nested config structure of this repository. For instance, I don't know how to write a python script that would use the config file in the examples you linked.

Hey @sammlapp, To simplify things and remove references to the nested Hydra configurations (which are apparently mainly useful for our internal experiments), you can instead use the high-level configs or just manually import the relevant methods and adjust parameters as needed (listed in the respective experiments). In #266, I provided an example demonstrating how to use Hydra or manually load parameters. Does this make more sense to you now / what its the current error you are facing?

sammlapp commented 3 days ago

Thanks, I will try using the high-level configs. Still, I am confused by the specific error I mentioned at the top of this issue - if I wanted to initialize the BirdSetDataModule directly with python code, I cannot find any set of parameters that avoids that specific error.

lurauch commented 2 days ago

Ah still the same error as above. Did you try to use the birdset.datamodule.pretrain_datamodule.PretrainDataModule instead of the BirdSetDataModule to load XCL?

lurauch commented 2 days ago

So something like this:

from birdset.datamodule.base_datamodule import DatasetConfig
from birdset.datamodule.pretrain_datamodule import PretrainDataModule

dm = PretrainDataModule(
    dataset= DatasetConfig(
        data_dir="/scratch/birdset/XCL", # specify your data directory!
        hf_path='DBD-research-group/BirdSet',
        hf_name='XCL',
        n_workers=3,
        val_split=None,
        task="multilabel",
        classlimit=500, #limit of samples per class
        eventlimit=2, #limit of events that are extracted for each sample
        sampling_rate=32_000,
        seed=42
    ),
)

dm.prepare_data()

save_path = dm.disk_save_path

sammlapp commented 2 days ago

This seems to be working, thank you (it's been running for an hour and is currently on the One-hot-encoding step). I wasn't aware of the PretrainDataModule class. Would it be accurate to say that the BirdSetDataModule is for inference or fine-tuning on the test sets and the PretrainDataModule is for training?

lurauch commented 1 day ago

The preprocessing for XCL should take a while. In your case (if you use eventlimit=2 and classlimit=500) you should end up with ~1.6 million samples. For our pre-trained models, we used eventlimit=1 and classlimit=500 (~700k samples as far as I remember).

What it does:

Download the data
Mapping: extract as many samples (=rows) from the dataset as there are detected events (around 3 millionen samples)
Smart sampling: restrict the dataset with event limit (how many events per recording are allowed) and class limit (how many samples per class is allowed at max). This method tries to balance the dataset.
One-hot-encode the ebird_code_multilabel (this may take some time since we have ~9700 classes, increasing the size of the dataset that is saved to disk)
The dataset should now have the columns: [filepath, labels, start_time, end_time). We can now load in the respective parts of the recording with soundfile in the event decoder in transforms.

The PretrainDataModule is - so far - only for training with the larger datasets (XCM and XCL). Everything else is done with the BirdSetDataModule.

sammlapp commented 8 hours ago

Thanks. It got further this time, but tries to load a 111 Gb array into memory (presumably the one-hot labels; has shape 1.5M x 9736) which causes an error. Is there a way to reduce memory usage? In OpenSoundscape I've started using sparse dtypes to avoid massive one-hot arrays, which greatly reduces memory size since they typically only have one or a few non-zero values per row.

Traceback (most recent call last):
  File "/home/sml161/birdset_download/download_XCL.py", line 95, in <module>
    dm.setup(stage="fit")
  File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 362, in setup
    self.train_dataset = self._get_dataset("train")
  File "/home/sml161/BirdSet/birdset/datamodule/pretrain_datamodule.py", line 55, in _get_dataset
    self.train_label_list = dataset["labels"]
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2872, in __getitem__
    return self._getitem(key)
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2857, in _getitem
    formatted_output = format_table(
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 639, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 405, in __call__
    return self.format_column(pa_table)
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/np_formatter.py", line 94, in format_column
    column = self.numpy_arrow_extractor().extract_column(pa_table)
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 162, in extract_column
    return self._arrow_array_to_numpy(pa_table[pa_table.column_names[0]])
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 203, in _arrow_array_to_numpy
    return np.array(array, copy=False)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 111. GiB for an array with shape (1528068, 9736) and data type int64

lurauch commented 6 hours ago

We should look into sparse dtypes! I guess we have some things to learn from OpenSoundscape haha

I just pushed a small update that should fix your issue. The self.train_label_list is now assigned with dataset["labels"] only when a weighted sampler is actually required.

DBD-research-group / BirdSet

KeyError when initializing XCL DataModule #268