Open sammlapp opened 2 weeks ago
As I mentioned in your other issue, XCM/XCL do not have a test split available. You need another datamodule. You can see this in the data config or in the experiment configs, that use XCM/XCL, e.g.: example 1, example 2
Thank you, it makes sense that XCL doesn't have a test split, but I'm not sure what about the configs for training on XCM/XCL that you linked will resolve the particular error I get above. Does it have to do with the "loaders:" section? https://github.com/DBD-research-group/BirdSet/blob/95921e953fee3cf9c5b395cc9dfb706db0149fcf/configs/experiment/birdset_neurips24/XCL/ast.yaml#L54
Part of my confusion may be because I'm unfamiliar with the deeply nested config structure of this repository. For instance, I don't know how to write a python script that would use the config file in the examples you linked.
Hey @sammlapp, sorry for late notice. I will review your comments in more detail at the beginning of next week. I was busy with some other tasks the whole week. It might also be helpful to schedule a meeting, as discussing this in person could be more effective :)
Thank you, it makes sense that XCL doesn't have a test split, but I'm not sure what about the configs for training on XCM/XCL that you linked will resolve the particular error I get above. Does it have to do with the "loaders:" section?
Part of my confusion may be because I'm unfamiliar with the deeply nested config structure of this repository. For instance, I don't know how to write a python script that would use the config file in the examples you linked.
Hey @sammlapp, To simplify things and remove references to the nested Hydra configurations (which are apparently mainly useful for our internal experiments), you can instead use the high-level configs or just manually import the relevant methods and adjust parameters as needed (listed in the respective experiments). In #266, I provided an example demonstrating how to use Hydra or manually load parameters. Does this make more sense to you now / what its the current error you are facing?
Thanks, I will try using the high-level configs. Still, I am confused by the specific error I mentioned at the top of this issue - if I wanted to initialize the BirdSetDataModule directly with python code, I cannot find any set of parameters that avoids that specific error.
Ah still the same error as above. Did you try to use the birdset.datamodule.pretrain_datamodule.PretrainDataModule
instead of the BirdSetDataModule
to load XCL?
So something like this:
from birdset.datamodule.base_datamodule import DatasetConfig
from birdset.datamodule.pretrain_datamodule import PretrainDataModule
dm = PretrainDataModule(
dataset= DatasetConfig(
data_dir="/scratch/birdset/XCL", # specify your data directory!
hf_path='DBD-research-group/BirdSet',
hf_name='XCL',
n_workers=3,
val_split=None,
task="multilabel",
classlimit=500, #limit of samples per class
eventlimit=2, #limit of events that are extracted for each sample
sampling_rate=32_000,
seed=42
),
)
dm.prepare_data()
save_path = dm.disk_save_path
This seems to be working, thank you (it's been running for an hour and is currently on the One-hot-encoding
step). I wasn't aware of the PretrainDataModule class. Would it be accurate to say that the BirdSetDataModule is for inference or fine-tuning on the test sets and the PretrainDataModule is for training?
The preprocessing for XCL
should take a while. In your case (if you use eventlimit=2 and classlimit=500) you should end up with ~1.6 million samples. For our pre-trained models, we used eventlimit=1 and classlimit=500 (~700k samples as far as I remember).
What it does:
ebird_code_multilabel
(this may take some time since we have ~9700 classes, increasing the size of the dataset that is saved to disk)The PretrainDataModule
is - so far - only for training with the larger datasets (XCM and XCL). Everything else is done with the BirdSetDataModule
.
Thanks. It got further this time, but tries to load a 111 Gb array into memory (presumably the one-hot labels; has shape 1.5M x 9736) which causes an error. Is there a way to reduce memory usage? In OpenSoundscape I've started using sparse dtypes to avoid massive one-hot arrays, which greatly reduces memory size since they typically only have one or a few non-zero values per row.
Traceback (most recent call last):
File "/home/sml161/birdset_download/download_XCL.py", line 95, in <module>
dm.setup(stage="fit")
File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 362, in setup
self.train_dataset = self._get_dataset("train")
File "/home/sml161/BirdSet/birdset/datamodule/pretrain_datamodule.py", line 55, in _get_dataset
self.train_label_list = dataset["labels"]
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2872, in __getitem__
return self._getitem(key)
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2857, in _getitem
formatted_output = format_table(
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 639, in format_table
return formatter(pa_table, query_type=query_type)
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 405, in __call__
return self.format_column(pa_table)
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/np_formatter.py", line 94, in format_column
column = self.numpy_arrow_extractor().extract_column(pa_table)
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 162, in extract_column
return self._arrow_array_to_numpy(pa_table[pa_table.column_names[0]])
File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 203, in _arrow_array_to_numpy
return np.array(array, copy=False)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 111. GiB for an array with shape (1528068, 9736) and data type int64
We should look into sparse dtypes! I guess we have some things to learn from OpenSoundscape haha
I just pushed a small update that should fix your issue. The self.train_label_list
is now assigned with dataset["labels"]
only when a weighted sampler is actually required.
Unlike test datasets, XCL seems to have a different structure that causes an error when initializing BirdSetDataModule:
results in
Oddly, I don't see anything in the example XCL training scripts that differs from how I am using XCL here. Can you help me understand what I'm doing wrong?