bmcfee / ismir2017_chords

ISMIR 2017: structured training for large vocab chord recognition
BSD 2-Clause "Simplified" License
48 stars 3 forks source link

Using the Structured Training approach with a fully convolutional architecture #6

Open cloudscapes opened 4 years ago

cloudscapes commented 4 years ago

Hi Biran,

I've built a fully convolutional UNet architecture for chord recognition. Looking at your code here, I see that you use a GRU for decoding and you get your predictions with a Time Distributed Layer that wraps a Dense Layer. Is there a way to use your structured training approach with a non- recurrent architecture? by that, I mean getting the root, bass and pitch predictions out of a fully convolutional, in my case a modified UNET architecture? In case my question seems to be too general I'll post my architecture here so you can see exactly what I mean. My input is an HCQT with h = [1,2,3].

As you can see I don't use any recurrent or dense layer(it's fully convolutional architecture). These MultiResBlocks replace normal grid-based convolutions with an adaptive version and allow the network to adjust its receptive field automatically. I want to see if this semantic segmentation architecture can localize simultaneous active pitches in time and is useful for chord recognition. Still, I'm a bit baffled as to how I can make your structured training approach work since I don't use a recurrent decoder. I can only hope that the structured training approach is not tied to a recurrent architecture??

Also to extract my features I created a pump just the way you did it in your code i.e:

p_feature = pumpp.feature.HCQTMag(name='cqt', sr=sr, hop_length=hop_length,over_sample = 3,harmonics=[1,2,3],log=True, conv='tf', n_octaves=6) p_chord_tag = pumpp.task.ChordTagTransformer(name='chord_tag', sr=sr, hop_length=hop_length, sparse=True) p_chord_struct = pumpp.task.ChordTransformer(name='chord_struct', sr=sr, hop_length=hop_length, sparse=True) pump = pumpp.Pump(p_feature, p_chord_tag, p_chord_struct) and later on transformed my audio and jam files using the convert function in your code.

But again you do

p0 = K.layers.Dense(len(pump['chord_tag'].vocabulary()), activation='softmax', bias_regularizer=K.regularizers.l2()) p1 = K.layers.TimeDistributed(p0)(rs) model = K.models.Model(x, p1)

and later on, you do

model.fit_generator(gen_train.tuples('cqt/mag', 'chord_tag/chord'), 512, 100, validation_data=gen_val.tuples('cqt/mag', 'chord_tag/chord'), validation_steps=1024, callbacks=[K.callbacks.ModelCheckpoint('/home/bmcfee/working/chords/model_simple_ckpt.pkl', save_best_only=True, verbose=1), K.callbacks.ReduceLROnPlateau(patience=5, verbose=1), K.callbacks.EarlyStopping(patience=15, verbose=0)])

can I use the model.fit_generator the way you do here for my UNet architecture too ?? or is the chord_tag/chord also tied to a recurrent architecture ??

So sry for this long text but I'm just very confused if I can use 'pumpp' and the chord/tag and chord/transformer in my case too as the libraries you have written are very clean and work well and I'd rather use them.

Any help would be highly appreciated.

Looking forward to your answer.

Cheers, H

conceptual Adaptive MultiResUnet:

`mresblock1 = MultiResBlock((2,2),9,32, x,block_num=1) pool1 = MaxPooling2D(pool_size=(2, 2),data_format = 'channels_first')(mresblock1) mresblock1 = ResPath((2,2),9,32, 4, mresblock1)

mresblock2 = MultiResBlock((2,2),9,322, pool1,block_num=2) pool2 = MaxPooling2D(pool_size=(2, 2),data_format = 'channels_first')(mresblock2) mresblock2 = ResPath((2,2),9,322, 3, mresblock2)

mresblock3 = MultiResBlock((2,2),9,324, pool2,block_num=3) pool3 = MaxPooling2D(pool_size=(2, 2),data_format='channels_first')(mresblock3) mresblock3 = ResPath((2,2),9,324, 2, mresblock3)

mresblock4 = MultiResBlock((2,2),9,16*8, pool3,block_num=4)

up5 = concatenate([trans_conv2d(mresblock4, 32*4, 2, 2, strides=(2, 2), padding='same'), mresblock3], axis=1)

mresblock5 = MultiResBlock((2,2),9,32*4, up5,block_num=5)

up6 = concatenate([trans_conv2d(mresblock5, 322, 2, 2, strides=(2, 2), padding='same'), mresblock2], axis=1) mresblock6 = MultiResBlock((2,2),9,322, up6,block_num =6)

up7 = concatenate([trans_conv2d(mresblock6, 321, 2, 2, strides=(2, 2), padding='same'), mresblock1], axis=1) mresblock8 = MultiResBlock((2,2),9,321, up7,block_num=7) conv9 = conv2d_bn(mresblock8, len(pump['chord_tag'].vocabulary()), 1, 1, activation='sigmoid')`

bmcfee commented 4 years ago

Is there a way to use your structured training approach with a non- recurrent architecture? by that, I mean getting the root, bass and pitch predictions out of a fully convolutional, in my case a modified UNET architecture?

Sure, the structured output idea is totally independent of the processing by which you get to the output layer. I see no reason why a unet wouldn't work for this, but our motivation for recurrence came from two ideas:

  1. Chords do not occupy a fixed amount of time (across all songs), but rather depend on tempo
  2. Not all chord tones need to be present at every frame to register as a chord, eg when you have arpeggios or some kind of implied harmony. Convolutional filters will have a hard time modeling this, but bidirectional RNNs could integrate these time variations out to produce a more stable representation.

That said, a UNet would probably work pretty well here. It would at least be interesting to see where the two architectures agree or disagree. The key thing is to make sure that your output layer produces estimates at the same frame rate as your labels (which in our implementation, was sampled to match the input frame rate).

cloudscapes commented 4 years ago

Hi Brian, as always, thank you so much for getting back and answering my question.

Sure, the structured output idea is totally independent of the processing by which you get to the output layer.

That's a relief!

It would at least be interesting to see where the two architectures agree or disagree

That's also my goal.

The key thing is to make sure that your output layer produces estimates at the same frame rate as your labels (which in our implementation, was sampled to match the input frame rate).

In order to compare our results I've followed your implementation up until the point the data enters the UNet architecture:

sr = 44100 hop_length = 4096 p_feature = pumpp.feature.HCQTMag(name='cqt', sr=sr, hop_length=hop_length,over_sample = 3,harmonics=[1,2,3],log=True, conv='tf', n_octaves=6) p_chord_tag = pumpp.task.ChordTagTransformer(name='chord_tag', sr=sr, hop_length=hop_length, sparse=True) p_chord_struct = pumpp.task.ChordTransformer(name='chord_struct', sr=sr, hop_length=hop_length, sparse=True) pump = pumpp.Pump(p_feature, p_chord_tag, p_chord_struct)

and used the "convert" function to transform both my audio data and jam files. 1) Is this what you mean by making sure my output layer produces estimates at the same frame rate as my labels?

Also, I'm gonna use your "sample" and "data_generator" functions to feed my data to the UNET architecture since your implementation is very well written(no need to reinvent the wheel!)

But my grand question is:

2) how am I going to incorporate the structured training approach with the UNet architecture?

You take the output of your bidirectional GRU layer and apply three different TimeDistributed Dense layers to get the pitch-class, root-class, and bass-class predictions and then you concatenate these three layers together. Later on, you do:

p0 = Dense(len(pump['chord_tag'].vocabulary()), activation='softmax', bias_regularizer= tf.keras.regularizers.l2()) tag = TimeDistributed(p0, name='chord_tag')(codec)

Shall I take the output of my last UNet-block before the usual classification layer and do:

pc_p = Conv2D((pump.fields['chord_struct/pitch'].shape[1]),1,1,activation='sigmoid', data_format = 'channels_first')(lastUnetBlock)

root_p = Conv2D(13,1,1,activation='softmax',data_format = 'channels_first')(lastUnetBlock)

bass_p = Conv2D(13,1,1 ,activation='softmax',data_format = 'channels_first')(lastUnetBlock)

codec = concatenate([lastUnetblock, pc_p, root_p, bass_p])

p0 = Conv2D(len(pump['chord_tag'].vocabulary()),1,1, activation='softmax', bias_regularizer= tf.keras.regularizers.l2(),data_format = 'channels_first')(lastUnetBlock)(lasUnetBlock)

and then 3) concatenate the pitch class, bass class and root class predictions with the "p0" that I defined above ??

Also along the way, before data enters your BI-GRU, you throw away the frequency bins(your lambda layer) and deal with only the time index.

4) Do I have to do the same thing(after the last UNet block )? or can I let the frequency bins stay as I don't use any recurrent layer ??

5) Also, do I need to use TimeDistributed(Conv2D(...)) or is it fine if I just use Conv2D ??

If I get these last steps right, I can basically start training/testing. Looking forward to your answer/help.

Cheers, H

BTW, If you think seeing my "model summary" could be a help, I can send you mine.