Refine VAD segmentation in short silences

Now, the dataset splitter splits data according to VAD settings which can produce long segments (>30s for example). The postprocessing splits these to 30s sharp, which ends up in split in speech.

We need update to split in some small silence close to the 30s.

It can be done on the level of data builder (GPU accelerated) or on the level of trainer transformation.

BUTSpeechFIT / huggingface_asr

Refine VAD segmentation in short silences #20