Dataset splitting not working

danuta-w commented 1 year ago

Hi,

when setting up training for an Animal-Spot binary classification I am presented with a weird error. The dataset seems to not be split according to specified values in main.py. As you can see in the error messages below, the training set contains 0 files whereas the validation and test split contain the remaining files.

When I run the script multiple time, it is random whether the train, val or test dataset is omitted. In all re-runs one of the categories contains 0 files which results in the error below.

What can I do?

Greetings, Danuta

15:18:01|I|Found 6878 audio files for training.
15:18:01|I|Model predict 2 classes
15:18:01|D|Generating /home/scb/scripts/ANIMAL-SPOT/ANIMAL-DATA/val.csv
15:18:01|D|Generating /home/scb/scripts/ANIMAL-SPOT/ANIMAL-DATA/bkp/test.csv
15:18:01|I|Init dataset train...
15:18:01|D|Number of files : 0
15:18:01|D|Init augmentation transforms for time and pitch shift
15:18:01|D|No noise augmentation
15:18:01|D|Init min-max-normalization activated
15:18:01|I|Init dataset val...
15:18:01|D|Number of files : 4184
15:18:01|D|Number of samples in val for noise: 3455
15:18:01|D|Number of samples in val for target: 729
15:18:01|D|Running without augmentation
15:18:01|D|Init min-max-normalization activated
15:18:01|I|Init dataset test...
15:18:01|D|Number of files : 2694
15:18:01|D|Number of samples in test for noise: 2545
15:18:01|D|Number of samples in test for target: 149
Traceback (most recent call last):
  File "/home/scb/scripts/ANIMAL-SPOT/ANIMAL-SPOT//main.py", line 454, in <module>
    dataloaders = {
15:18:01|D|Running without augmentation
15:18:01|D|Init min-max-normalization activated
  File "/home/scb/scripts/ANIMAL-SPOT/ANIMAL-SPOT//main.py", line 455, in <dictcomp>
    split: torch.utils.data.DataLoader(
  File "/home/scb/miniconda3/envs/animalspot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 351, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/home/scb/miniconda3/envs/animalspot/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 107, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

ChristianBergler commented 1 year ago

Hi Danuta, i am not aware about your data structure, but i have a guess what might went wrong. ANIMAL-SPOT internally takes the following filename structure "label_id_year_tape_startlabeltime_endlabeltime" ... Based on the "Year and Tape" information it internally creates a set of "recording tapes" based on the given data. A recording tape is always the comination between year and tapename. When ANIMAL-SPOT is doing the data split (automatically) it makes sure that NONE of the tapes are shared across partitions, in order to avoid "cheating", e.g. audio data from the same tape, distributed across training and test, makes it easier for the model, because it has already seen the data during training. So, and i think this is your problem. Very likely the amount of different tapes (in your case) is not much, so ANIMAL-SPOT puts the stuff either in one of the buckets but nothing is left for the remaining buckets. In case you dont have more different tapes and everything comes e.g. from one recording, you can also "fool" ANIMAL-SPOT by naming the "year_tape" information in an artificial random way, to simulate different recording tapes. That should solve your problem

danuta-w commented 1 year ago

Hi Christian,

Thank you for getting back to me so quickly. My data are indeed from long deployments of acoustic tags on penguins, and even though they are saved in shorter chunks, to synchronize the audio recordings with data from the tag's other sensors, I use a single deployment ID and time re start of deployment. I will try renaming the files and giving the training another go.

Thanks again for your help, Danuta

On Tue, May 23, 2023 at 6:18 PM ChristianBergler @.***> wrote:

Hi Danuta, i am not aware about your data structure, but i have a guess what might went wrong. ANIMAL-SPOT internally takes the following filename structure "label_id_year_tape_startlabeltime_endlabeltime" ... Based on the "Year and Tape" information it internally creates a set of "recording tapes" based on the given data. A recording tape is always the comination between year and tapename. When ANIMAL-SPOT is doing the data split (automatically) it makes sure that NONE of the tapes are shared across partitions, in order to avoid "cheating", e.g. audio data from the same tape, distributed across training and test, makes it easier for the model, because it has already seen the data during training. So, and i think this is your problem. Very likely the amount of different tapes (in your case) is not much, so ANIMAL-SPOT puts the stuff either in one of the buckets but nothing is left for the remaining buckets. In case you dont have more different tapes and everything comes e.g. from one recording, you can also "fool" ANIMAL-SPOT by naming the "year_tape" information in an artificial random way, to simulate different recording tapes. That should solve your problem

— Reply to this email directly, view it on GitHub https://github.com/ChristianBergler/ANIMAL-SPOT/issues/3#issuecomment-1559763234, or unsubscribe https://github.com/notifications/unsubscribe-auth/AID6EJ6YYRJJUR6KUPAFYX3XHTPMJANCNFSM6AAAAAAYKONRZQ . You are receiving this because you authored the thread.Message ID: @.***>

danuta-w commented 1 year ago

Hi again,

The training works with the randomized tape names. Thanks again!

Best, Danuta

ChristianBergler / ANIMAL-SPOT

Dataset splitting not working #3