finejuly / dcase2018_task2_cochlearai

Cochlear.ai submission for dcase2018 task2
MIT License
17 stars 3 forks source link

how melspectrogram is a vector not a matrix #1

Closed creativesh closed 5 years ago

creativesh commented 5 years ago

Hi I have run your model code in utils/model.py and saw that the output ahpe of line 67(melspectrogram) is (?,64,?,1) its confusing,because generally mel spectrograms are for examples 96* 64 ,they are matrices but your output is an image,it has shape 1 in last dimension. can you please explain me ?

finejuly commented 5 years ago

Hi creativesh,

In this framework, melspectrogram is reshaped from (frequency, time) to (frequency, time, filter) so we can use 2d Conv as image. So here a melspectrogram is handled as an grayscale image where frequency and time is considered as height and width (or width and height).

creativesh commented 5 years ago

Thank you, so what does mean in each dimension? Why one of dimensions is 1?

On Sun, Jan 13, 2019 at 6:19 PM finejuly notifications@github.com wrote:

Hi creativesh,

In this framework, melspectrogram is reshaped from (frequency, time) to (frequency, time, filter) so we can use 2d Conv as image. So here a melspectrogram is handled as an grayscale image where frequency and time is considered as height and width (or width and height).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/finejuly/dcase2018_task2_cochlearai/issues/1#issuecomment-453835776, or mute the thread https://github.com/notifications/unsubscribe-auth/AVy2RVB3hhrBJR81MYaSfE5wbkKnQ5zXks5vC0dsgaJpZM4Z9C-X .

--

با احترام. خسروانی

finejuly commented 5 years ago

Thank you, so what does mean in each dimension? Why one of dimensions is 1? On Sun, Jan 13, 2019 at 6:19 PM finejuly @.***> wrote: Hi creativesh, In this framework, melspectrogram is reshaped from (frequency, time) to (frequency, time, filter) so we can use 2d Conv as image. So here a melspectrogram is handled as an grayscale image where frequency and time is considered as height and width (or width and height). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AVy2RVB3hhrBJR81MYaSfE5wbkKnQ5zXks5vC0dsgaJpZM4Z9C-X . -- با احترام. خسروانی

Each dimension means (batch, frequency, time, filter). If your question is about how to use melspectrogram for 2dConv, you can check our paper or other related studies, including the baseline system for dcase2018 task2 (http://dcase.community/challenge2018/task-general-purpose-audio-tagging, https://arxiv.org/abs/1807.09902).