descriptinc / descript-audio-codec

State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5
MIT License
1.12k stars 101 forks source link

Different code sizes when encoding versus when compressing #45

Open marypilataki opened 10 months ago

marypilataki commented 10 months ago

Hello,

Thanks again for the great work.

I am raising this issue because I noticed that I get a different dimensionality for codes when using model.encode compared to when using model.compress. As an example, I used the script you provide under 'Programmatic Usage' in the README file.

For 10 seconds of audio @ 44100 Hz, z has a dimensionality of [1, 1024, 862] and codes has a dimensionality of [1, 9, 862]. Those are the quantised continuous representation and the codebook indices respectively returned by the quantizer (ResidualVectorQuantize) when call model.encode.

For the same 10 seconds of audio, z has a dimensionality of [1, 1024, 1152] and codes has a dimensionality of [1, 9, 1152]. Those are the quantised continuous representation and the codebook indices respectively returned when calling model.compress, before creating the DAC file. It seems like in those two cases the number of 72-sized chunks differ? Am I misunderstanding something here?

Thank you!

Mary

mazzzystar commented 3 months ago

I think the diffrence is caused by the padding here, though I'm not sure where exactally it happens. https://github.com/descriptinc/descript-audio-codec/blob/c7cfc5d2647e26471dc394f95846a0830e7bec34/dac/model/base.py#L188-L214

stg1205 commented 3 months ago

same question. I think it should be related to the hop size during compressing, but I don't understand why during decompressing, each decoded chunk is concatenated without removing the hop segment.