ChristianBergler / ANIMAL-SPOT

An Animal Independent Deep Learning Framework for Bioacoustic Signal Segmentation and Classification Including a Detailed User-Guide
GNU General Public License v3.0
35 stars 5 forks source link

Precision on sound duration for training #9

Closed liofeu closed 1 month ago

liofeu commented 1 month ago

Hello,

In https://github.com/ChristianBergler/ANIMAL-SPOT?tab=readme-ov-file#data-preparation, it is said:

The annotated bioacoustic data samples do not have to be of equal length (variable durations are possible). However, during training a fixed sequence length has to be chosen, which should be close to the average duration of the involved animal-specific target vocalization(s).

I am not sure to understand. Should all the sounds of a particular animal be the same duration during the training phase? or only the ones from a specific call from a given animal?

All the best,

ChristianBergler commented 1 month ago

Hello,

during the training you have to specifiy the parameter --sequence_len which takes a duration in millisecondes (e.g. 1000 means 1s of audio length). Irrespective how long your training data samples are, the model will "randomly" pick 1s of audio chunk out of the actual sequence in case it is longer than 1s. In case it is shorter it will apply zero-padding. So, that is why, you have to think about kind of an "average call duration". If you have calls in the training which do possess a large temporal difference e.g. some clicks of 2-5 ms compared to other pulsed vocalizations with 500ms to whatever 5-10s, i would recommend to train to different models. Other than that, make sure that your calls in the training data contain mostly only "animal vocalization", in order to guarantee that once the model is randomly picking a signal part, from an audio file which is annotated as animal sound, it really gets animal sound. If this is given you are fine. The reason is that you want to avoid extracting a constant signal part from an audio file which has a specific label, but does not contain the corresponding signal content (e.g. you say this is noise and animal sound is actually in the data excerpt). I hope this helps.

Best, Christian