alexandrainst / coral

Danish ASR and TTS models associated with the CoRal project.
MIT License
9 stars 1 forks source link

Feat/optimise building of asr dataset #79

Closed saattrupdan closed 5 months ago

saattrupdan commented 5 months ago

This optimises the building of the dataset, from previous ~5 hours to ~1 minute.

It still takes a fair amount of time to upload the dataset to the hub though, around 4-8 hours (haven't finished a full upload yet, so basing this off tqdm estimates).

The following optimisations were made:

  1. Copy the database from NAS to disk, which reduces reading data from the database from ~1.5 hours to ~10 seconds. We also don't need batching anymore, which yields a slight speedup.
  2. When matching metadata with audio files, we store a filename -> path mapping initially, and use the dict.get method instead of the previous list comprehension. This reduces the matching from ~1.5-2 hours to ~1 second.
  3. Split the dataset in batches, which reduces the time from ~1-2 hours to <1 second.

Tried to optimise the collection of all the audio paths, which currently takes ~1 minute, but none of my attempts worked. This is responsible for ~99% of the time spent actually building the dataset.

Of course the ultimate bottleneck is the uploading of the data, but I'm not sure how that can be improved. It does include some I/O as it has to package all the audio files into the parquet format, but the only way to improve this might be copying all the audio files to disk first, but that puts quite a large requirement on disk space (takes up ~83 GB).

This PR also fixes the sample rate issue, where I previously enforced 16k sampling rate, but now uses the native one, being 48k.